Search code examples
character-encodingtokenizefast-esp

Fast ESP character normalization


I'm running a search application on a FAST ESP server. Now I have this problem with character normalization.

What I want is to search for 'wurth' and get a hit in 'würth'.

i've tried configuring the following in esp/etc/tokenizer/tokenization.xml

 <normalizationlist name="German to Norwegian">
   <normalization description="German u with diaeresis, to Norwegian u">
      <input>x75</input> 
      <output>xFC</output> 
      <output>x75</output>
   </normalization>
  </normalizationlist>

but of cours, this translate all u to ü, which is useless.

How do I configure this the right way?


Solution

  • The solution is to normalize every "special character" to the same "normal character";

    ö -> o ø -> o å -> a ä -> a æ -> a

    This is at bit time consuming, but it works!