Text Normalization and Transliteration

This plugin applies normalization, transliteration and phonetic transformations to text strings.

Convergence Text Enricher

Plug-in ID

Convergence Text Enricher - com.semarchy.engine.plugins.convergence.text

Description

This enricher applies normalization, transliteration and phonetic transformations to text strings. It takes an Input Text and applies an Input Filter to this text, for example to remove all characters but letters. Then it applies a series of transformations defined in the Transformation parameter and returns a Transformed Text.

Plug-in Parameters

The following table lists the plug-in parameters.

Parameter Name Mandatory Type Description
Input Filter No String Filter applied to the input text before the transformation. Valid values for the Filter are: NONE, which applies no filter, LETTERS, which removes all non-letter characters from the input string and STANDARD, which tokenizes the input text by splitting words.
Transformation Yes String A pipe-separated sequence of transformation definitions. Transformations include: NORMALIZE, TRANSLITERATE [<Id>] and PHONETIC <Type> [<MaxCodeLengh>]. See the Transformations section for a detailed description of each transformation.

Plug-in Inputs

The following table lists the plug-in inputs.

Parameter Name Mandatory Type Description
Input Text Yes String Text to transform.

Plug-in Outputs

The following table lists the plug-in inputs.

Parameter Name Mandatory Type Description
Transformed Text Yes String Filtered and transformed text.
Secondary Transformed Text Yes String Secondary transformed text. This text may contain transformation resulting from a Beidermorse or Double Metaphone transformation. See Other Transformations for more information .

Input Filters

The following input filters are supported by the enricher:

Transformations

The following transformations definitions are supported by the enricher:

It is possible to sequence transformations. Successive transformations are separated by a pipe “|” sign.
Examples of transformations:

Normalization

The NORMALIZE transformation normalizes the string by applying a series of transformations, which map similar characters to a common target, to ignore certain distinctions between similar characters. This includes accent removal, case folding, etc.

Example of transformations:

The complete list of transformations is given below:

Accent removal Hebrew Alternates folding Overline folding Suzhou Numeral folding
Case folding Jamo folding Positional forms folding Symbol folding
Canonical duplicates folding Letterforms folding Small forms folding Underline folding
Dashes folding Math symbol folding Space folding Vertical forms folding
Diacritic removal
(including stroke, hook, descender)
Multigraph
Expansions: All
Spacing Accents
folding
Width folding
Greek letterforms folding Native digit folding Subscript folding
Han Radical folding No-break folding Superscript folding

For more information about these transformations see the UTR#30 Characters Foldings transformation.

Phonetic Transformations

A phonetic transformation applied to the string transforms it to a string corresponding to its pronunciation. The default phonetic transformation is PHONETIC METAPHONE.

Phonetic transformations include:

Other Transformations

These other transformations return a list of tokens which can be split into the Transformed Text and Secondary Transformed Text outputs.

Note: These transformations should be preferably used at the end of the transformation sequence, as their secondary transformed text is not processed in subsequent transformations in the sequence.

Other transformations include:

Transliteration

The TRANSLITERATE transformation transforms a text from one character script to another. For example, Traditional to Simplified Chinese, Japanese Hiragana to Katakana, Cyrillic to Latin script.
Each source/target transliteration is identified by an ID. The list of supported transliteration IDs is provided in the list below. If no ID is provided, the Any-Latin transliteration is used.

Each ID represents a transliteration from one script/language to another. For example: Katakana-Latin, Latin-thai, etc. The special tag any stands for any script/language. For example, Any-Latin converts any input script to Latin script.

Accents-Any Any-Name Devanagari-Bengali Han-Latin Latin-Greek Pinyin-NumericPinyin
Amharic-Latin/BGN Any-NFC Devanagari-Gujarati Han-Latin/Names Latin-Greek/UNGEGN pl_FONIPA-ja
Any-Accents Any-NFD Devanagari-Gurmukhi Hangul-Latin Latin-Gujarati pl-ja
Any-am Any-NFKC Devanagari-Kannada Hans-Hant Latin-Gurmukhi pl-pl_FONIPA
Any-Arabic Any-NFKD Devanagari-Latin Hant-Hans Latin-Han Publishing-Any
Any-Armenian Any-Null Devanagari-Malayalam Hebrew-Latin Latin-Hangul ro_FONIPA-ja
Any-Bengali Any-Oriya Devanagari-Oriya Hebrew-Latin/BGN Latin-Hebrew ro-ja
Any-Bopomofo Any-pl_FONIPA Devanagari-Tamil Hex-Any Latin-Hiragana ro-ro_FONIPA
Any-CaseFold Any-Publishing Devanagari-Telugu Hex-Any/C Latin-Jamo ru-ja
Any-cs_FONIPA Any-Remove Digit-Tone Hex-Any/Java Latin-Kannada ru-zh
Any-Cyrillic Any-ro_FONIPA es_419-ja Hex-Any/Perl Latin-Katakana Russian-Latin/BGN
Any-Devanagari Any-ru es_419-zh Hex-Any/Unicode Latin-Malayalam Serbian-Latin/BGN
Any-es_419_FONIPA Any-sk_FONIPA es_FONIPA-am Hex-Any/XML Latin-NumericPinyin Simplified-Traditional
Any-es_FONIPA Any-Syriac es_FONIPA-es_419_FONIPA Hex-Any/XML10 Latin-Oriya sk_FONIPA-ja
Any-FCC Any-Tamil es_FONIPA-ja Hiragana-Katakana Latin-Syriac sk-ja
Any-FCD Any-Telugu es_FONIPA-zh Hiragana-Latin Latin-Tamil sk-sk_FONIPA
Any-Georgian Any-Thaana es-am IPA-XSampa Latin-Telugu Syriac-Latin
Any-Greek Any-Thai es-es_FONIPA it-am Latin-Thaana Tamil-Bengali
Any-Greek/UNGEGN Any-Title es-ja it-ja Latin-Thai Tamil-Devanagari
Any-Gujarati Any-Upper es-zh ja_Latn-ko Macedonian-Latin/BGN Tamil-Gujarati
Any-Gurmukhi Any-zh Fullwidth-Halfwidth ja_Latn-ru Malayalam-Bengali Tamil-Gurmukhi
Any-Han Arabic-Latin Georgian-Latin Jamo-Latin Malayalam-Devanagari Tamil-Kannada
Any-Hangul Arabic-Latin/BGN Georgian-Latin/BGN JapaneseKana-Latin/BGN Malayalam-Gujarati Tamil-Latin
Any-Hans Armenian-Latin Greek-Latin Kannada-Bengali Malayalam-Gurmukhi Tamil-Malayalam
Any-Hant Armenian-Latin/BGN Greek-Latin/BGN Kannada-Devanagari Malayalam-Kannada Tamil-Oriya
Any-Hebrew ASCII-Latin Greek-Latin/UNGEGN Kannada-Gujarati Malayalam-Latin Tamil-Telugu
Any-Hex Azerbaijani-Latin/BGN Gujarati-Bengali Kannada-Gurmukhi Malayalam-Oriya Telugu-Bengali
Any-Hex/C Belarusian-Latin/BGN Gujarati-Devanagari Kannada-Latin Malayalam-Tamil Telugu-Devanagari
Any-Hex/Java Bengali-Devanagari Gujarati-Gurmukhi Kannada-Malayalam Malayalam-Telugu Telugu-Gujarati
Any-Hex/Perl Bengali-Gujarati Gujarati-Kannada Kannada-Oriya Maldivian-Latin/BGN Telugu-Gurmukhi
Any-Hex/Plain Bengali-Gurmukhi Gujarati-Latin Kannada-Tamil Mongolian-Latin/BGN Telugu-Kannada
Any-Hex/Unicode Bengali-Kannada Gujarati-Malayalam Kannada-Telugu Name-Any Telugu-Latin
Any-Hex/XML Bengali-Latin Gujarati-Oriya Katakana-Hiragana NumericPinyin-Latin Telugu-Malayalam
Any-Hex/XML10 Bengali-Malayalam Gujarati-Tamil Katakana-Latin NumericPinyin-Pinyin Telugu-Oriya
Any-Hiragana Bengali-Oriya Gujarati-Telugu Kazakh-Latin/BGN Oriya-Bengali Telugu-Tamil
Any-ja Bengali-Tamil Gurmukhi-Bengali Kirghiz-Latin/BGN Oriya-Devanagari Thaana-Latin
Any-Kannada Bengali-Telugu Gurmukhi-Devanagari Korean-Latin/BGN Oriya-Gujarati Thai-Latin
Any-Katakana Bopomofo-Latin Gurmukhi-Gujarati Latin-Arabic Oriya-Gurmukhi Tone-Digit
Any-ko Bulgarian-Latin/BGN Gurmukhi-Kannada Latin-Armenian Oriya-Kannada Traditional-Simplified
Any-Latin (default) cs_FONIPA-ja Gurmukhi-Latin Latin-ASCII Oriya-Latin Turkmen-Latin/BGN
Any-Latin/BGN cs_FONIPA-ko Gurmukhi-Malayalam Latin-Bengali Oriya-Malayalam Ukrainian-Latin/BGN
Any-Latin/Names cs-cs_FONIPA Gurmukhi-Oriya Latin-Bopomofo Oriya-Tamil Uzbek-Latin/BGN
Any-Latin/UNGEGN cs-ja Gurmukhi-Tamil Latin-Cyrillic Oriya-Telugu XSampa-IPA
Any-Lower cs-ko Gurmukhi-Telugu Latin-Devanagari Pashto-Latin/BGN zh_Latn_PINYIN-ru
Any-Malayalam Cyrillic-Latin Halfwidth-Fullwidth Latin-Georgian Persian-Latin/BGN