Text Normalization and Transliteration

This plugin applies normalization, transliteration and phonetic transformations to text strings.

Convergence Text Enricher

Plug-in ID

Convergence Text Enricher - com.semarchy.engine.plugins.convergence.text

Description

This enricher applies normalization, transliteration and phonetic transformations to text strings. It takes an Input Text and applies an Input Filter to this text, for example to remove all characters but letters. Then it applies a series of transformations defined in the Transformation parameter and returns a Transformed Text.

Plug-in Parameters

The following table lists the plug-in parameters.

Parameter Name Mandatory Type Description
Input Filter No String Filter applied to the input text before the transformation. Valid values for the Filter are: NONE, which applies no filter, and LETTERS, which removes all non-letter characters from the input string.
Transformation Yes String A pipe-separated sequence of transformation definitions. Transformations include: NORMALIZE, TRANSLITERATE [<Id>] and PHONETIC <Type> [<MaxCodeLengh>]. See the Definition Transformations section for a detailed description of each transformation.

Plug-in Inputs

The following table lists the plug-in inputs.

Parameter Name Mandatory Type Description
Input Text Yes String Text to transform.

Plug-in Outputs

The following table lists the plug-in inputs.

Parameter Name Mandatory Type Description
Transformed Text Yes String Filtered and transformed text.

Defining Transformations

The following transformations definitions are supported by the enricher:

It is possible to sequence transformations. Successive transformations are separated by a pipe “|” sign.
Examples of transformations:

Normalization

This transformation normalizes the string by applying a series of transformations, including accent removal, case folding, etc. This transformation implements the UTR#30 Characters Foldings transformation, which maps similar characters to a common target, to ignore certain distinctions between similar characters.

Phonetic Transformations

A phonetic transformation applied to the string transforms it to a string corresponding to its pronunciation. The default phonetic transformation is METAPHONE.

Phonetic transformations include:

Transliteration

Transliteration transforms a text from one character script to another. For example, Traditional to Simplified Chinese, Japanese Hiragana to Katakana, Cyrillic to Latin script.
Each source/target transliteration is identified by an ID. The list of supported transliteration IDs is provided in the list below. If no ID is provided, the Any-Latin transliteration is used.

Each ID represents a transliteration from one script/language to another. For example: Katakana-Latin, Latin-thai, etc. The special tag any stands for any script/language. For example, Any-Latin converts any input script to Latin script.

ASCII-Latin Gurmukhi-Gujarati Latin-Jamo Tamil-Telugu Any-Remove Any-Hans
Accents-Any Gurmukhi-Kannada Latin-Kannada Telugu-Bengali Any-Hex/Unicode Any-Hiragana
Amharic-Latin/BGN Gurmukhi-Latin Latin-Katakana Telugu-Devanagari Any-Hex/Java Any-ro_FONIPA
Any-Accents Gurmukhi-Malayalam Latin-Malayalam Telugu-Gujarati Any-Hex/C Any-Gujarati
Any-Publishing Gurmukhi-Oriya Latin-NumericPinyin Telugu-Gurmukhi Any-Hex/XML Any-Latin/UNGEGN
Arabic-Latin Gurmukhi-Tamil Latin-Oriya Telugu-Kannada Any-Hex/XML10 Any-Hangul
Arabic-Latin/BGN Gurmukhi-Telugu Latin-Syriac Telugu-Latin Any-Hex/Perl Any-Han
Armenian-Latin Halfwidth-Fullwidth Latin-Tamil Telugu-Malayalam Any-Hex/Plain Any-Arabic
Armenian-Latin/BGN Han-Latin Latin-Telugu Telugu-Oriya Any-Hex Any-Syriac
Azerbaijani-Latin/BGN Han-Latin/Names Latin-Thaana Telugu-Tamil Hex-Any/Unicode Any-Hebrew
Belarusian-Latin/BGN Hangul-Latin Latin-Thai Thaana-Latin Hex-Any/Java Any-Thai
Bengali-Devanagari Hans-Hant Macedonian-Latin/BGN Thai-Latin Hex-Any/C Any-Cyrillic
Bengali-Gujarati Hant-Hans Malayalam-Bengali Tone-Digit Hex-Any/XML Any-Georgian
Bengali-Gurmukhi Hebrew-Latin Malayalam-Devanagari Traditional-Simplified Hex-Any/XML10 Any-Armenian
Bengali-Kannada Hebrew-Latin/BGN Malayalam-Gujarati Turkmen-Latin/BGN Hex-Any/Perl Any-Greek
Bengali-Latin Hiragana-Katakana Malayalam-Gurmukhi Ukrainian-Latin/BGN Hex-Any Any-Greek/UNGEGN
Bengali-Malayalam Hiragana-Latin Malayalam-Kannada Uzbek-Latin/BGN Any-Lower Any-Bopomofo
Bengali-Oriya IPA-XSampa Malayalam-Latin XSampa-IPA Any-Upper Any-Thaana
Bengali-Tamil Jamo-Latin Malayalam-Oriya cs-cs_FONIPA Any-Title Any-es_FONIPA
Bengali-Telugu JapaneseKana-Latin/BGN Malayalam-Tamil cs-ja Any-CaseFold
Bopomofo-Latin Kannada-Bengali Malayalam-Telugu cs-ko Any-Name
Bulgarian-Latin/BGN Kannada-Devanagari Maldivian-Latin/BGN cs_FONIPA-ja Name-Any
Cyrillic-Latin Kannada-Gujarati Mongolian-Latin/BGN cs_FONIPA-ko Any-NFC
Devanagari-Bengali Kannada-Gurmukhi NumericPinyin-Latin es-am Any-NFD
Devanagari-Gujarati Kannada-Latin NumericPinyin-Pinyin es-es_FONIPA Any-NFKC
Devanagari-Gurmukhi Kannada-Malayalam Oriya-Bengali es-ja Any-NFKD
Devanagari-Kannada Kannada-Oriya Oriya-Devanagari es-zh Any-FCD
Devanagari-Latin Kannada-Tamil Oriya-Gujarati es_419-ja Any-FCC
Devanagari-Malayalam Kannada-Telugu Oriya-Gurmukhi es_419-zh Any-Latin
Devanagari-Oriya Katakana-Hiragana Oriya-Kannada es_FONIPA-am Any-Latin/Names
Devanagari-Tamil Katakana-Latin Oriya-Latin es_FONIPA-es_419_FONIPA Any-Latin/BGN
Devanagari-Telugu Kazakh-Latin/BGN Oriya-Malayalam es_FONIPA-ja Any-zh
Digit-Tone Kirghiz-Latin/BGN Oriya-Tamil es_FONIPA-zh Any-am
Fullwidth-Halfwidth Korean-Latin/BGN Oriya-Telugu it-am Any-es_419_FONIPA
Georgian-Latin Latin-ASCII Pashto-Latin/BGN it-ja Any-ja
Georgian-Latin/BGN Latin-Arabic Persian-Latin/BGN ja_Latn-ko Any-Katakana
Greek-Latin Latin-Armenian Pinyin-NumericPinyin ja_Latn-ru Any-ru
Greek-Latin/BGN Latin-Bengali Publishing-Any pl-ja Any-sk_FONIPA
Greek-Latin/UNGEGN Latin-Bopomofo Russian-Latin/BGN pl-pl_FONIPA Any-cs_FONIPA
Gujarati-Bengali Latin-Cyrillic Serbian-Latin/BGN pl_FONIPA-ja Any-ko
Gujarati-Devanagari Latin-Devanagari Simplified-Traditional ro-ja Any-Telugu
Gujarati-Gurmukhi Latin-Georgian Syriac-Latin ro-ro_FONIPA Any-Oriya
Gujarati-Kannada Latin-Greek Tamil-Bengali ro_FONIPA-ja Any-Gurmukhi
Gujarati-Latin Latin-Greek/UNGEGN Tamil-Devanagari ru-ja Any-Devanagari
Gujarati-Malayalam Latin-Gujarati Tamil-Gujarati ru-zh Any-Malayalam
Gujarati-Oriya Latin-Gurmukhi Tamil-Gurmukhi sk-ja Any-Bengali
Gujarati-Tamil Latin-Han Tamil-Kannada sk-sk_FONIPA Any-Tamil
Gujarati-Telugu Latin-Hangul Tamil-Latin sk_FONIPA-ja Any-Kannada
Gurmukhi-Bengali Latin-Hebrew Tamil-Malayalam zh_Latn_PINYIN-ru Any-pl_FONIPA
Gurmukhi-Devanagari Latin-Hiragana Tamil-Oriya Any-Null Any-Hant