Search code examples
solrnlpicu

ICUTransformFilter in SOLR


I get the below output after I configured ICUTransformFilter in SOLR

สวัสดี is converted to s̄wạs̄dī Not able to understand Which script did it convert to? My configuration in schema looks like below

<analyzer type="index">
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Thai-Latin" />
    <filter class="solr.ICUTransformFilterFactory" id="NFD; [:Nonspacing Mark:] Remove; NFC" />
    <filter class="solr.BeiderMorseFilterFactory" />
</analyzer>

It says Thai-Latin , but when I use google translator it converts it to "slave" enter image description here


Solution

  • This seems to be copied from my Thai example, where the sequence of analyzers is already explained. That configuration is used to be able to search for something like 'sawadika' and get actual Thai text containing the original word that sounds like that (female-originated) greeting.

    You seem to be confusing translation (Thai to English in Google Translate) with transliteration (mapping Thai to phonetically matching/close latin). Transliteration is what is happening here (actually Google is showing that as well). In summary, after the first analyzer, you still have tonal marks that try to show raising/lowering/etc tones that Thai language has. The second analyzer should remove them to get swasdi. The final analyzer will then do some phonetic broadening to match other alternative spellings.