Semarchy Text enricher

The Semarchy Text enricher applies normalization, transliteration and phonetic transformations to text strings.

Plug-in ID

Semarchy Text Enricher - com.semarchy.engine.plugins.convergence.text

Description

This enricher applies normalization, transliteration and phonetic transformations to text strings. It takes an Input Text and applies an Input Filter to this text, for example to remove all characters but letters. Then it applies a series of transformations defined in the Transformation parameter and returns a Transformed Text.

This plug-in is thread-safe and supports parallel execution.

Plug-in parameters

The following table lists the plug-in parameters.

Parameter name Mandatory Type Description

Input Filter

No

String

Filter applied to the input text before the transformation. Valid values for the Filter are: NONE, which applies no filter, LETTERS, which removes all non-letter characters from the input string and STANDARD, which tokenizes the input text by splitting words.

Transformation

Yes

String

A pipe-separated sequence of transformation definitions. Transformations include:

  • NORMALIZE

  • TRANSLITERATE [<Id>]

  • PHONETIC <Type> [<MaxCodeLengh>]

  • BEIDERMORSE [Split] [RuleType] [MaxPhonemes] [NameType]

  • DOUBLEMETAPHONE [<max_code_length>] [split].

See the Transformations section for a detailed description of each transformation.

Synonyms Separator

No

String

Separator used between the synonyms returned by the enricher. Default value is a pipe (|).

Plug-in inputs

The following table lists the plug-in inputs.

Input name Mandatory Type Description

Input Text

Yes

String

Text to transform.

Plug-in outputs

The following table lists the plug-in outputs.

Output name Type Description

Transformed Text

String

Filtered and transformed text.

Secondary Transformed Text

String

Secondary transformed text. This text may contain transformation resulting from a Beidermorse or Double Metaphone transformation. See Other transformations for more information.

Input filters

The following input filters are supported by the enricher:

  • NONE: No filter is applied to the input text.

  • LETTERS: This transformation removes all non-letter characters from the input string.

  • STANDARD: Breaks words in the input text according to the rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Transformations

The following transformations definitions are supported by the enricher:

  • Normalization

  • Phonetic Transformation

    • PHONETIC [SOUNDEX | REFINEDSOUNDEX | METAPHONE [<max_code_length>] | DOUBLEMETAPHONE [<max_code_length>] | CAVERPHONE | CAVERPHONE1 | NYSIIS | MRA | COLOGNE | BEIDERMORSE ]: applies Phonetic transformations

  • Other Transformations

    • BEIDERMORSE [Split] [RuleType] [MaxPhonems] [NameType]

    • DOUBLEMETAPHONE [<max_code_length>] [split]

  • Transliteration

    • TRANSLITERATE [<ID>] apply a Transliteration transformation to the string. The transliteration is identified by an ID. If not ID is provided, the Any-Latin transliteration is used.

It is possible to sequence transformations. Successive transformations are separated by a pipe | sign.
Examples of transformations:

  • Normalize and apply Phonetic Soundex: NORMALIZE | SOUNDEX

  • Normalize and then transliterate to Latin script: NORMALIZE | TRANSLITERATE Any-Latin

  • Normalize, transliterate to Latin script and then apply Metaphone with a maximum resulting length of 5 characters: NORMALIZE | TRANSLITERATE Any-Latin | PHONETIC METAPHONE 5

  • Perform a BEIDERMORSE transformation for family names with an approximate transformation on generic name types: BEIDERMORSE APPROX 10 FALSE GENERIC

Normalization

The NORMALIZE transformation normalizes the string by applying a series of transformations, which map similar characters to a common target, to ignore certain distinctions between similar characters. This includes accent removal, case folding, etc.

Example of transformations:

Original Text Normalized Text Comments

‒ – — ―

- - - -

4 different dashes converted to 4 similar dashes.

AbSoLuteLy TRUE

absolutely true

CaseFolding

…​

...

convert [dotdotdot] to [dot dot dot]

½ Tsp

1/2 tsp

Symbol folding

Æsop

aesop

Äsop

asop

Dürst

durst

Encyclopædia

encyclopaedia

œuvre

oeuvre

poſt

post

résumé français

resume francais

Accent removal and case folding

Straße

strasse

٣ is a magic number

3 is a magic number

Native Digital folding

The complete list of transformations is given below:

Accent removal

Hebrew Alternates folding

Overline folding

Suzhou Numeral folding

Case folding

Jamo folding

Positional forms folding

Symbol folding

Canonical duplicates folding

Letterforms folding

Small forms folding

Underline folding

Dashes folding

Math symbol folding

Space folding

Vertical forms folding

Diacritic removal (including stroke, hook, descender)

Multigraph Expansions: All

Spacing Accents folding

Width folding

Greek letterforms folding

Native digit folding

Subscript folding

Han Radical folding

For more information about these transformations see the UTR#30 Characters Foldings transformation.

Phonetic transformations

A phonetic transformation applied to the string transforms it to a string corresponding to its pronunciation. The default phonetic transformation is PHONETIC METAPHONE.

Phonetic transformations include:

  • PHONETIC SOUNDEX and PHONETIC REFINEDSOUNDEX: Phonetic algorithms for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. More information about Soundex

  • PHONETIC METAPHONE and PHONETIC DOUBLEMETAPHONE are algorithms for indexing words by their English pronunciation. They are suitable for use with most English words, not just names. Double Metaphone can return both a primary and a secondary code for an input string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. These algorithms support a Max Code Length parameter which defines the maximum length of the encoded result. This value default to 4. More Details about Metaphone.

  • PHONETIC CAVERPHONE and PHONETIC CAVERPHONE1. Algorithm for data matching for electoral rolls, optimized for accents present in parts of New Zealand. More Details about Caverphone and Caverphone 1

  • PHONETIC NYSIIS. New York State Identification and Intelligence System (NYSIIS), which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding. More Details about NYSIIS

  • PHONETIC MRA: Match Rating Approach developed by Western Airlines - this algorithm has an encoding and range comparison technique. More Details about MRA

  • PHONETIC COLOGNE Phonetic algorithm optimized for the German language. See Kölner Phonetik

  • PHONETIC BEIDERMORSE is a phonetic algorithm supporting greater accuracy in matching Slavic and Yiddish surnames with similar pronunciation but differences in spelling. It returns a list of tokens (separated by the string specified in the Synonyms Separator parameter.): first the transformed input text, then the transformed synonyms of the input text. More information about Beidermorse.

Other transformations

These other transformations return a list of tokens which can be split into the Transformed Text and Secondary Transformed Text outputs.

These transformations should be preferably used at the end of the transformation sequence, as their secondary transformed text is not processed in subsequent transformations in the sequence.

Other transformations include:

  • BEIDERMORSE [<split>] [<rule_type>] [<max_phonems>] [<name_type>] The Beidermorse transformation returns a list of tokens: first the transformed input text, then the transformed synonyms of the input text. Beidermorse supports the following parameters:

    • split. If this parameter is set to true all synonyms after the first one are concatenated in the Secondary Transformed Text output. If this parameter is set to false (default value) all synonyms are appended to the first token in the Transformed Text output.

    • rule_type is EXACT for exact or APPROX for approximate phonetic transformation.

    • max_phonems is the maximum number of synonyms returned. Default is 20.

    • name_type default value is GENERIC. Use ASHKENAZI or SEPHARDIC if you specifically want phonetic encodings optimized for Ashkenazi or Sephardic Jewish family names.

  • DOUBLEMETAPHONE [<max_code_length>] [<split>]. This transformation encodes the input string with the Double Metaphone algorithm and returns a primary code and a secondary code. If split is set to true, then the secondary code is pushed to the Secondary Transformed Text output. Otherwise, it is concatenated to the primary code in the Transformed Text output.

Transliteration

The TRANSLITERATE transformation transforms a text from one character script to another. For example, Traditional to Simplified Chinese, Japanese Hiragana to Katakana, Cyrillic to Latin script.
Each source/target transliteration is identified by an ID. The list of supported transliteration IDs is provided in the list below. If no ID is provided, the Any-Latin transliteration is used.

Each ID represents a transliteration from one script/language to another. For example: Katakana-Latin, Latin-thai, etc. The special tag any stands for any script/language. For example, Any-Latin converts any input script to Latin script.

Accents-Any

Any-Name

Devanagari-Bengali

Han-Latin

Latin-Greek

Pinyin-NumericPinyin

Amharic-Latin/BGN

Any-NFC

Devanagari-Gujarati

Han-Latin/Names

Latin-Greek/UNGEGN

pl_FONIPA-ja

Any-Accents

Any-NFD

Devanagari-Gurmukhi

Hangul-Latin

Latin-Gujarati

pl-ja

Any-am

Any-NFKC

Devanagari-Kannada

Hans-Hant

Latin-Gurmukhi

pl-pl_FONIPA

Any-Arabic

Any-NFKD

Devanagari-Latin

Hant-Hans

Latin-Han

Publishing-Any

Any-Armenian

Any-Null

Devanagari-Malayalam

Hebrew-Latin

Latin-Hangul

ro_FONIPA-ja

Any-Bengali

Any-Oriya

Devanagari-Oriya

Hebrew-Latin/BGN

Latin-Hebrew

ro-ja

Any-Bopomofo

Any-pl_FONIPA

Devanagari-Tamil

Hex-Any

Latin-Hiragana

ro-ro_FONIPA

Any-CaseFold

Any-Publishing

Devanagari-Telugu

Hex-Any/C

Latin-Jamo

ru-ja

Any-cs_FONIPA

Any-Remove

Digit-Tone

Hex-Any/Java

Latin-Kannada

ru-zh

Any-Cyrillic

Any-ro_FONIPA

es_419-ja

Hex-Any/Perl

Latin-Katakana

Russian-Latin/BGN

Any-Devanagari

Any-ru

es_419-zh

Hex-Any/Unicode

Latin-Malayalam

Serbian-Latin/BGN

Any-es_419_FONIPA

Any-sk_FONIPA

es_FONIPA-am

Hex-Any/XML

Latin-NumericPinyin

Simplified-Traditional

Any-es_FONIPA

Any-Syriac

es_FONIPA-es_419_FONIPA

Hex-Any/XML10

Latin-Oriya

sk_FONIPA-ja

Any-FCC

Any-Tamil

es_FONIPA-ja

Hiragana-Katakana

Latin-Syriac

sk-ja

Any-FCD

Any-Telugu

es_FONIPA-zh

Hiragana-Latin

Latin-Tamil

sk-sk_FONIPA

Any-Georgian

Any-Thaana

es-am

IPA-XSampa

Latin-Telugu

Syriac-Latin

Any-Greek

Any-Thai

es-es_FONIPA

it-am

Latin-Thaana

Tamil-Bengali

Any-Greek/UNGEGN

Any-Title

es-ja

it-ja

Latin-Thai

Tamil-Devanagari

Any-Gujarati

Any-Upper

es-zh

ja_Latn-ko

Macedonian-Latin/BGN

Tamil-Gujarati

Any-Gurmukhi

Any-zh

Fullwidth-Halfwidth

ja_Latn-ru

Malayalam-Bengali

Tamil-Gurmukhi

Any-Han

Arabic-Latin

Georgian-Latin

Jamo-Latin

Malayalam-Devanagari

Tamil-Kannada

Any-Hangul

Arabic-Latin/BGN

Georgian-Latin/BGN

JapaneseKana-Latin/BGN

Malayalam-Gujarati

Tamil-Latin

Any-Hans

Armenian-Latin

Greek-Latin

Kannada-Bengali

Malayalam-Gurmukhi

Tamil-Malayalam

Any-Hant

Armenian-Latin/BGN

Greek-Latin/BGN

Kannada-Devanagari

Malayalam-Kannada

Tamil-Oriya

Any-Hebrew

ASCII-Latin

Greek-Latin/UNGEGN

Kannada-Gujarati

Malayalam-Latin

Tamil-Telugu

Any-Hex

Azerbaijani-Latin/BGN

Gujarati-Bengali

Kannada-Gurmukhi

Malayalam-Oriya

Telugu-Bengali

Any-Hex/C

Belarusian-Latin/BGN

Gujarati-Devanagari

Kannada-Latin

Malayalam-Tamil

Telugu-Devanagari

Any-Hex/Java

Bengali-Devanagari

Gujarati-Gurmukhi

Kannada-Malayalam

Malayalam-Telugu

Telugu-Gujarati

Any-Hex/Perl

Bengali-Gujarati

Gujarati-Kannada

Kannada-Oriya

Maldivian-Latin/BGN

Telugu-Gurmukhi

Any-Hex/Plain

Bengali-Gurmukhi

Gujarati-Latin

Kannada-Tamil

Mongolian-Latin/BGN

Telugu-Kannada

Any-Hex/Unicode

Bengali-Kannada

Gujarati-Malayalam

Kannada-Telugu

Name-Any

Telugu-Latin

Any-Hex/XML

Bengali-Latin

Gujarati-Oriya

Katakana-Hiragana

NumericPinyin-Latin

Telugu-Malayalam

Any-Hex/XML10

Bengali-Malayalam

Gujarati-Tamil

Katakana-Latin

NumericPinyin-Pinyin

Telugu-Oriya

Any-Hiragana

Bengali-Oriya

Gujarati-Telugu

Kazakh-Latin/BGN

Oriya-Bengali

Telugu-Tamil

Any-ja

Bengali-Tamil

Gurmukhi-Bengali

Kirghiz-Latin/BGN

Oriya-Devanagari

Thaana-Latin

Any-Kannada

Bengali-Telugu

Gurmukhi-Devanagari

Korean-Latin/BGN

Oriya-Gujarati

Thai-Latin

Any-Katakana

Bopomofo-Latin

Gurmukhi-Gujarati

Latin-Arabic

Oriya-Gurmukhi

Tone-Digit

Any-ko

Bulgarian-Latin/BGN

Gurmukhi-Kannada

Latin-Armenian

Oriya-Kannada

Traditional-Simplified

Any-Latin (default)

cs_FONIPA-ja

Gurmukhi-Latin

Latin-ASCII

Oriya-Latin

Turkmen-Latin/BGN

Any-Latin/BGN

cs_FONIPA-ko

Gurmukhi-Malayalam

Latin-Bengali

Oriya-Malayalam

Ukrainian-Latin/BGN

Any-Latin/Names

cs-cs_FONIPA

Gurmukhi-Oriya

Latin-Bopomofo

Oriya-Tamil

Uzbek-Latin/BGN

Any-Latin/UNGEGN

cs-ja

Gurmukhi-Tamil

Latin-Cyrillic

Oriya-Telugu

XSampa-IPA

Any-Lower

cs-ko

Gurmukhi-Telugu

Latin-Devanagari

Pashto-Latin/BGN

zh_Latn_PINYIN-ru

Any-Malayalam

Cyrillic-Latin

Halfwidth-Fullwidth

Latin-Georgian

Persian-Latin/BGN