Text Normalization and Transliteration | ||
---|---|---|
Previous | Next | |
Introduction | Translation |
This plugin applies normalization, transliteration and phonetic transformations to text strings.
Convergence Text Enricher - com.semarchy.engine.plugins.convergence.text
This enricher applies normalization, transliteration and phonetic transformations to text strings. It takes an Input Text and applies an Input Filter to this text, for example to remove all characters but letters. Then it applies a series of transformations defined in the Transformation parameter and returns a Transformed Text.
The following table lists the plug-in parameters.
Parameter Name | Mandatory | Type | Description |
---|---|---|---|
Input Filter | No | String | Filter applied to the input text before the transformation. Valid values for the Filter are:
NONE , which applies no filter,
LETTERS , which removes all non-letter characters from the input string and
STANDARD , which tokenizes the input text by splitting words.
|
Transformation | Yes | String | A pipe-separated sequence of transformation definitions. Transformations include:
NORMALIZE ,
TRANSLITERATE [<Id>] and
PHONETIC <Type> [<MaxCodeLengh>] . See the
Transformations section for a detailed description of each transformation.
|
The following table lists the plug-in inputs.
Parameter Name | Mandatory | Type | Description |
---|---|---|---|
Input Text | Yes | String | Text to transform. |
The following table lists the plug-in inputs.
Parameter Name | Mandatory | Type | Description |
---|---|---|---|
Transformed Text | Yes | String | Filtered and transformed text. |
Secondary Transformed Text | Yes | String | Secondary transformed text. This text may contain transformation resulting from a Beidermorse or Double Metaphone transformation. See Other Transformations for more information . |
The following input filters are supported by the enricher:
NONE
: No filter is applied to the input text.
LETTERS
: This transformation removes all non-letter characters from the input string.
STANDARD
: Breaks words in the input text according to the rules from the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
The following transformations definitions are supported by the enricher:
NORMALIZE
: Performs a
Normalization
PHONETIC [SOUNDEX | REFINEDSOUNDEX | METAPHONE [<max_code_length>] | DOUBLEMETAPHONE [<max_code_length>] | CAVERPHONE | CAVERPHONE1 | NYSIIS | MRA | COLOGNE | BEIDERMORSE ]
: applies a
Phonetic Transformation
BEIDERMORSE [Split] [RuleType] [MaxPhonems] [NameType]
DOUBLEMETAPHONE [<max_code_length>] [split]
TRANSLITERATE [<ID>]
apply a
Transliteration transformation to the string. The transliteration is identified by an ID. If not ID is provided, the
Any-Latin transliteration is used.
It is possible to sequence transformations. Successive transformations are separated by a pipe “|” sign.
Examples of transformations:
NORMALIZE | SOUNDEX
NORMALIZE | TRANSLITERATE Any-Latin
NORMALIZE | TRANSLITERATE Any-Latin | PHONETIC METAPHONE 5
BEIDERMORSE APPROX 10 FALSE GENERIC
The
NORMALIZE
transformation normalizes the string by applying a series of transformations, which map similar characters to a common target, to ignore certain distinctions between similar characters. This includes accent removal, case folding, etc.
Example of transformations:
The complete list of transformations is given below:
Accent removal | Hebrew Alternates folding | Overline folding | Suzhou Numeral folding |
Case folding | Jamo folding | Positional forms folding | Symbol folding |
Canonical duplicates folding | Letterforms folding | Small forms folding | Underline folding |
Dashes folding | Math symbol folding | Space folding | Vertical forms folding |
Diacritic removal (including stroke, hook, descender) |
Multigraph Expansions: All |
Spacing Accents folding |
Width folding |
Greek letterforms folding | Native digit folding | Subscript folding | |
Han Radical folding | No-break folding | Superscript folding |
For more information about these transformations see the UTR#30 Characters Foldings transformation.
A phonetic transformation applied to the string transforms it to a string corresponding to its pronunciation. The default phonetic transformation is
PHONETIC METAPHONE
.
Phonetic transformations include:
PHONETIC SOUNDEX
and
PHONETIC REFINEDSOUNDEX
:Phonetic algorithms for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. More information about
Soundex
PHONETIC METAPHONE
and
PHONETIC DOUBLEMETAPHONE
are algorithms for indexing words by their English pronunciation. They are suitable for use with most English words, not just names. Double Metaphone can return both a primary and a secondary code for an input string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. These algorithms support a Max Code Length parameter which defines the maximum length of the encoded result. This value default to 4. More Details about
Metaphone.
PHONETIC CAVERPHONE
and
PHONETIC CAVERPHONE1
. Algorithm for data matching for electoral rolls, optimized for accents present in parts of New Zealand. More Details about
Caverphone and
Caverphone 1
PHONETIC NYSIIS
. New York State Identification and Intelligence System (NYSIIS), which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding. More Details about
NYSIIS
PHONETIC MRA
: Match Rating Approach developed by Western Airlines – this algorithm has an encoding and range comparison technique. More Details about
MRA
PHONETIC COLOGNE
Phonetic algorithm optimized for the German language. See
Kölner Phonetik
PHONETIC BEIDERMORSE
is a phonetic algorithm supporting greater accuracy in matching Slavic and Yiddish surnames with similar pronunciation but differences in spelling. It returns a list of tokens (pipe separated): first the transformed input text, then the transformed synonyms of the input text. More information about
Beidermorse.
These other transformations return a list of tokens which can be split into the Transformed Text and Secondary Transformed Text outputs.
Note: These transformations should be preferably used at the end of the transformation sequence, as their secondary transformed text is not processed in subsequent transformations in the sequence.
Other transformations include:
BEIDERMORSE [<split>] [<rule_type>] [<max_phonems>] [<name_type>]
The Beidermorse transformation returns a list of tokens: first the transformed input text, then the transformed synonyms of the input text. Beidermorse supports the following parameters:
true
all synonyms after the first one are concatenated in the Secondary Transformed Text output. If this parameter is set to
false
(default value) all synonyms are appended to the first token in the Transformed Text output.
EXACT
for exact or
APPROX
for approximate phonetic transformation.
GENERIC
. Use
ASHKENAZI
or
SEPHARDIC
if you specifically want phonetic encodings optimized for Ashkenazi or Sephardic Jewish family names.
DOUBLEMETAPHONE [<max_code_length>] [<split>]
. This transformation encodes the input string with the Double Metaphone algorithm and returns a primary code and a secondary code. If
split is set to
true
, then the secondary code is pushed to the Secondary Transformed Text output. Otherwise, it is concatenated to the primary code in the Transformed Text output.
The
TRANSLITERATE
transformation transforms a text from one character script to another. For example, Traditional to Simplified Chinese, Japanese Hiragana to Katakana, Cyrillic to Latin script.
Each source/target transliteration is identified by an ID. The list of supported transliteration IDs is provided in the list below. If no ID is provided, the
Any-Latin transliteration is used.
Each ID represents a transliteration from one script/language to another. For example: Katakana-Latin, Latin-thai, etc. The special tag any stands for any script/language. For example, Any-Latin converts any input script to Latin script.
Accents-Any | Any-Name | Devanagari-Bengali | Han-Latin | Latin-Greek | Pinyin-NumericPinyin |
Amharic-Latin/BGN | Any-NFC | Devanagari-Gujarati | Han-Latin/Names | Latin-Greek/UNGEGN | pl_FONIPA-ja |
Any-Accents | Any-NFD | Devanagari-Gurmukhi | Hangul-Latin | Latin-Gujarati | pl-ja |
Any-am | Any-NFKC | Devanagari-Kannada | Hans-Hant | Latin-Gurmukhi | pl-pl_FONIPA |
Any-Arabic | Any-NFKD | Devanagari-Latin | Hant-Hans | Latin-Han | Publishing-Any |
Any-Armenian | Any-Null | Devanagari-Malayalam | Hebrew-Latin | Latin-Hangul | ro_FONIPA-ja |
Any-Bengali | Any-Oriya | Devanagari-Oriya | Hebrew-Latin/BGN | Latin-Hebrew | ro-ja |
Any-Bopomofo | Any-pl_FONIPA | Devanagari-Tamil | Hex-Any | Latin-Hiragana | ro-ro_FONIPA |
Any-CaseFold | Any-Publishing | Devanagari-Telugu | Hex-Any/C | Latin-Jamo | ru-ja |
Any-cs_FONIPA | Any-Remove | Digit-Tone | Hex-Any/Java | Latin-Kannada | ru-zh |
Any-Cyrillic | Any-ro_FONIPA | es_419-ja | Hex-Any/Perl | Latin-Katakana | Russian-Latin/BGN |
Any-Devanagari | Any-ru | es_419-zh | Hex-Any/Unicode | Latin-Malayalam | Serbian-Latin/BGN |
Any-es_419_FONIPA | Any-sk_FONIPA | es_FONIPA-am | Hex-Any/XML | Latin-NumericPinyin | Simplified-Traditional |
Any-es_FONIPA | Any-Syriac | es_FONIPA-es_419_FONIPA | Hex-Any/XML10 | Latin-Oriya | sk_FONIPA-ja |
Any-FCC | Any-Tamil | es_FONIPA-ja | Hiragana-Katakana | Latin-Syriac | sk-ja |
Any-FCD | Any-Telugu | es_FONIPA-zh | Hiragana-Latin | Latin-Tamil | sk-sk_FONIPA |
Any-Georgian | Any-Thaana | es-am | IPA-XSampa | Latin-Telugu | Syriac-Latin |
Any-Greek | Any-Thai | es-es_FONIPA | it-am | Latin-Thaana | Tamil-Bengali |
Any-Greek/UNGEGN | Any-Title | es-ja | it-ja | Latin-Thai | Tamil-Devanagari |
Any-Gujarati | Any-Upper | es-zh | ja_Latn-ko | Macedonian-Latin/BGN | Tamil-Gujarati |
Any-Gurmukhi | Any-zh | Fullwidth-Halfwidth | ja_Latn-ru | Malayalam-Bengali | Tamil-Gurmukhi |
Any-Han | Arabic-Latin | Georgian-Latin | Jamo-Latin | Malayalam-Devanagari | Tamil-Kannada |
Any-Hangul | Arabic-Latin/BGN | Georgian-Latin/BGN | JapaneseKana-Latin/BGN | Malayalam-Gujarati | Tamil-Latin |
Any-Hans | Armenian-Latin | Greek-Latin | Kannada-Bengali | Malayalam-Gurmukhi | Tamil-Malayalam |
Any-Hant | Armenian-Latin/BGN | Greek-Latin/BGN | Kannada-Devanagari | Malayalam-Kannada | Tamil-Oriya |
Any-Hebrew | ASCII-Latin | Greek-Latin/UNGEGN | Kannada-Gujarati | Malayalam-Latin | Tamil-Telugu |
Any-Hex | Azerbaijani-Latin/BGN | Gujarati-Bengali | Kannada-Gurmukhi | Malayalam-Oriya | Telugu-Bengali |
Any-Hex/C | Belarusian-Latin/BGN | Gujarati-Devanagari | Kannada-Latin | Malayalam-Tamil | Telugu-Devanagari |
Any-Hex/Java | Bengali-Devanagari | Gujarati-Gurmukhi | Kannada-Malayalam | Malayalam-Telugu | Telugu-Gujarati |
Any-Hex/Perl | Bengali-Gujarati | Gujarati-Kannada | Kannada-Oriya | Maldivian-Latin/BGN | Telugu-Gurmukhi |
Any-Hex/Plain | Bengali-Gurmukhi | Gujarati-Latin | Kannada-Tamil | Mongolian-Latin/BGN | Telugu-Kannada |
Any-Hex/Unicode | Bengali-Kannada | Gujarati-Malayalam | Kannada-Telugu | Name-Any | Telugu-Latin |
Any-Hex/XML | Bengali-Latin | Gujarati-Oriya | Katakana-Hiragana | NumericPinyin-Latin | Telugu-Malayalam |
Any-Hex/XML10 | Bengali-Malayalam | Gujarati-Tamil | Katakana-Latin | NumericPinyin-Pinyin | Telugu-Oriya |
Any-Hiragana | Bengali-Oriya | Gujarati-Telugu | Kazakh-Latin/BGN | Oriya-Bengali | Telugu-Tamil |
Any-ja | Bengali-Tamil | Gurmukhi-Bengali | Kirghiz-Latin/BGN | Oriya-Devanagari | Thaana-Latin |
Any-Kannada | Bengali-Telugu | Gurmukhi-Devanagari | Korean-Latin/BGN | Oriya-Gujarati | Thai-Latin |
Any-Katakana | Bopomofo-Latin | Gurmukhi-Gujarati | Latin-Arabic | Oriya-Gurmukhi | Tone-Digit |
Any-ko | Bulgarian-Latin/BGN | Gurmukhi-Kannada | Latin-Armenian | Oriya-Kannada | Traditional-Simplified |
Any-Latin (default) | cs_FONIPA-ja | Gurmukhi-Latin | Latin-ASCII | Oriya-Latin | Turkmen-Latin/BGN |
Any-Latin/BGN | cs_FONIPA-ko | Gurmukhi-Malayalam | Latin-Bengali | Oriya-Malayalam | Ukrainian-Latin/BGN |
Any-Latin/Names | cs-cs_FONIPA | Gurmukhi-Oriya | Latin-Bopomofo | Oriya-Tamil | Uzbek-Latin/BGN |
Any-Latin/UNGEGN | cs-ja | Gurmukhi-Tamil | Latin-Cyrillic | Oriya-Telugu | XSampa-IPA |
Any-Lower | cs-ko | Gurmukhi-Telugu | Latin-Devanagari | Pashto-Latin/BGN | zh_Latn_PINYIN-ru |
Any-Malayalam | Cyrillic-Latin | Halfwidth-Fullwidth | Latin-Georgian | Persian-Latin/BGN |
Previous | Top | Next |
Introduction | Translation |