Text Normalization and Transliteration | ||
---|---|---|
Previous | Next | |
Introduction | International Phone Numbers Plug-In |
This plugin applies normalization, transliteration and phonetic transformations to text strings.
Convergence Text Enricher - com.semarchy.engine.plugins.convergence.text
This enricher applies normalization, transliteration and phonetic transformations to text strings. It takes an Input Text and applies an Input Filter to this text, for example to remove all characters but letters. Then it applies a series of transformations defined in the Transformation parameter and returns a Transformed Text.
The following table lists the plug-in parameters.
Parameter Name | Mandatory | Type | Description |
---|---|---|---|
Input Filter | No | String | Filter applied to the input text before the transformation. Valid values for the Filter are:
NONE , which applies no filter, and
LETTERS , which removes all non-letter characters from the input string.
|
Transformation | Yes | String | A pipe-separated sequence of transformation definitions. Transformations include:
NORMALIZE ,
TRANSLITERATE [<Id>] and
PHONETIC <Type> [<MaxCodeLengh>] . See the
Definition Transformations section for a detailed description of each transformation.
|
The following table lists the plug-in inputs.
Parameter Name | Mandatory | Type | Description |
---|---|---|---|
Input Text | Yes | String | Text to transform. |
The following table lists the plug-in inputs.
Parameter Name | Mandatory | Type | Description |
---|---|---|---|
Transformed Text | Yes | String | Filtered and transformed text. |
The following transformations definitions are supported by the enricher:
NORMALIZE
: Performs a
Normalization
PHONETIC [METAPHONE [<max_code_length>] | DOUBLEMETAPHONE [<max_code_length>] | CAVERPHONE | SOUNDEX | REFINEDSOUNDEX]
: applies a
Phonetic Transformation
TRANSLITERATE [<ID>]
apply a
Transliteration transformation to the string. The transliteration is identified by an ID. If not ID is provided, the
Any-Latin transliteration is used.
It is possible to sequence transformations. Successive transformations are separated by a pipe “|” sign.
Examples of transformations:
NORMALIZE | SOUNDEX
NORMALIZE | TRANSLITERATE Any-Latin
NORMALIZE | TRANSLITERATE Any-Latin | PHONETIC METAPHONE 5
This transformation normalizes the string by applying a series of transformations, including accent removal, case folding, etc. This transformation implements the UTR#30 Characters Foldings transformation, which maps similar characters to a common target, to ignore certain distinctions between similar characters.
A phonetic transformation applied to the string transforms it to a string corresponding to its pronunciation. The default phonetic transformation is METAPHONE.
Phonetic transformations include:
Transliteration transforms a text from one character script to another. For example, Traditional to Simplified Chinese, Japanese Hiragana to Katakana, Cyrillic to Latin script.
Each source/target transliteration is identified by an ID. The list of supported transliteration IDs is provided in the list below. If no ID is provided, the
Any-Latin transliteration is used.
Each ID represents a transliteration from one script/language to another. For example: Katakana-Latin, Latin-thai, etc. The special tag any stands for any script/language. For example, Any-Latin converts any input script to Latin script.
ASCII-Latin | Gurmukhi-Gujarati | Latin-Jamo | Tamil-Telugu | Any-Remove | Any-Hans |
Accents-Any | Gurmukhi-Kannada | Latin-Kannada | Telugu-Bengali | Any-Hex/Unicode | Any-Hiragana |
Amharic-Latin/BGN | Gurmukhi-Latin | Latin-Katakana | Telugu-Devanagari | Any-Hex/Java | Any-ro_FONIPA |
Any-Accents | Gurmukhi-Malayalam | Latin-Malayalam | Telugu-Gujarati | Any-Hex/C | Any-Gujarati |
Any-Publishing | Gurmukhi-Oriya | Latin-NumericPinyin | Telugu-Gurmukhi | Any-Hex/XML | Any-Latin/UNGEGN |
Arabic-Latin | Gurmukhi-Tamil | Latin-Oriya | Telugu-Kannada | Any-Hex/XML10 | Any-Hangul |
Arabic-Latin/BGN | Gurmukhi-Telugu | Latin-Syriac | Telugu-Latin | Any-Hex/Perl | Any-Han |
Armenian-Latin | Halfwidth-Fullwidth | Latin-Tamil | Telugu-Malayalam | Any-Hex/Plain | Any-Arabic |
Armenian-Latin/BGN | Han-Latin | Latin-Telugu | Telugu-Oriya | Any-Hex | Any-Syriac |
Azerbaijani-Latin/BGN | Han-Latin/Names | Latin-Thaana | Telugu-Tamil | Hex-Any/Unicode | Any-Hebrew |
Belarusian-Latin/BGN | Hangul-Latin | Latin-Thai | Thaana-Latin | Hex-Any/Java | Any-Thai |
Bengali-Devanagari | Hans-Hant | Macedonian-Latin/BGN | Thai-Latin | Hex-Any/C | Any-Cyrillic |
Bengali-Gujarati | Hant-Hans | Malayalam-Bengali | Tone-Digit | Hex-Any/XML | Any-Georgian |
Bengali-Gurmukhi | Hebrew-Latin | Malayalam-Devanagari | Traditional-Simplified | Hex-Any/XML10 | Any-Armenian |
Bengali-Kannada | Hebrew-Latin/BGN | Malayalam-Gujarati | Turkmen-Latin/BGN | Hex-Any/Perl | Any-Greek |
Bengali-Latin | Hiragana-Katakana | Malayalam-Gurmukhi | Ukrainian-Latin/BGN | Hex-Any | Any-Greek/UNGEGN |
Bengali-Malayalam | Hiragana-Latin | Malayalam-Kannada | Uzbek-Latin/BGN | Any-Lower | Any-Bopomofo |
Bengali-Oriya | IPA-XSampa | Malayalam-Latin | XSampa-IPA | Any-Upper | Any-Thaana |
Bengali-Tamil | Jamo-Latin | Malayalam-Oriya | cs-cs_FONIPA | Any-Title | Any-es_FONIPA |
Bengali-Telugu | JapaneseKana-Latin/BGN | Malayalam-Tamil | cs-ja | Any-CaseFold | |
Bopomofo-Latin | Kannada-Bengali | Malayalam-Telugu | cs-ko | Any-Name | |
Bulgarian-Latin/BGN | Kannada-Devanagari | Maldivian-Latin/BGN | cs_FONIPA-ja | Name-Any | |
Cyrillic-Latin | Kannada-Gujarati | Mongolian-Latin/BGN | cs_FONIPA-ko | Any-NFC | |
Devanagari-Bengali | Kannada-Gurmukhi | NumericPinyin-Latin | es-am | Any-NFD | |
Devanagari-Gujarati | Kannada-Latin | NumericPinyin-Pinyin | es-es_FONIPA | Any-NFKC | |
Devanagari-Gurmukhi | Kannada-Malayalam | Oriya-Bengali | es-ja | Any-NFKD | |
Devanagari-Kannada | Kannada-Oriya | Oriya-Devanagari | es-zh | Any-FCD | |
Devanagari-Latin | Kannada-Tamil | Oriya-Gujarati | es_419-ja | Any-FCC | |
Devanagari-Malayalam | Kannada-Telugu | Oriya-Gurmukhi | es_419-zh | Any-Latin | |
Devanagari-Oriya | Katakana-Hiragana | Oriya-Kannada | es_FONIPA-am | Any-Latin/Names | |
Devanagari-Tamil | Katakana-Latin | Oriya-Latin | es_FONIPA-es_419_FONIPA | Any-Latin/BGN | |
Devanagari-Telugu | Kazakh-Latin/BGN | Oriya-Malayalam | es_FONIPA-ja | Any-zh | |
Digit-Tone | Kirghiz-Latin/BGN | Oriya-Tamil | es_FONIPA-zh | Any-am | |
Fullwidth-Halfwidth | Korean-Latin/BGN | Oriya-Telugu | it-am | Any-es_419_FONIPA | |
Georgian-Latin | Latin-ASCII | Pashto-Latin/BGN | it-ja | Any-ja | |
Georgian-Latin/BGN | Latin-Arabic | Persian-Latin/BGN | ja_Latn-ko | Any-Katakana | |
Greek-Latin | Latin-Armenian | Pinyin-NumericPinyin | ja_Latn-ru | Any-ru | |
Greek-Latin/BGN | Latin-Bengali | Publishing-Any | pl-ja | Any-sk_FONIPA | |
Greek-Latin/UNGEGN | Latin-Bopomofo | Russian-Latin/BGN | pl-pl_FONIPA | Any-cs_FONIPA | |
Gujarati-Bengali | Latin-Cyrillic | Serbian-Latin/BGN | pl_FONIPA-ja | Any-ko | |
Gujarati-Devanagari | Latin-Devanagari | Simplified-Traditional | ro-ja | Any-Telugu | |
Gujarati-Gurmukhi | Latin-Georgian | Syriac-Latin | ro-ro_FONIPA | Any-Oriya | |
Gujarati-Kannada | Latin-Greek | Tamil-Bengali | ro_FONIPA-ja | Any-Gurmukhi | |
Gujarati-Latin | Latin-Greek/UNGEGN | Tamil-Devanagari | ru-ja | Any-Devanagari | |
Gujarati-Malayalam | Latin-Gujarati | Tamil-Gujarati | ru-zh | Any-Malayalam | |
Gujarati-Oriya | Latin-Gurmukhi | Tamil-Gurmukhi | sk-ja | Any-Bengali | |
Gujarati-Tamil | Latin-Han | Tamil-Kannada | sk-sk_FONIPA | Any-Tamil | |
Gujarati-Telugu | Latin-Hangul | Tamil-Latin | sk_FONIPA-ja | Any-Kannada | |
Gurmukhi-Bengali | Latin-Hebrew | Tamil-Malayalam | zh_Latn_PINYIN-ru | Any-pl_FONIPA | |
Gurmukhi-Devanagari | Latin-Hiragana | Tamil-Oriya | Any-Null | Any-Hant |
Previous | Top | Next |
Introduction | International Phone Numbers Plug-In |