Search code examples
numberslettersaffixhunspellword-boundary

How to change a Hunspell affix file to allow numbers in words?


OCR programs often mistakenly recognize the capital letter O as a zero or vice versa. For example, they might recognize Over as 0ver or well as we11.

I tried to add

REP 0 O
REP 1 l

to the affix file, but it didn't work because numbers are apparently considered word boundaries.

(I had a look at the hunspell man page, but I can't figure out which of the numerous settings needs to be changed to allow numbers in words.)


Solution

  • From the manpages:

    REP what replacement This table specifies modifications to try first. First REP is the header of this table and one or more REP data line are following it. With this table, Hunspell can suggest the right forms for the typical spelling mistakes when the incorrect form differs by more than 1 letter from the right form. The search string supports the regex boundary signs (^ and $). For example a possible English replacement table definition to handle misspelled consonants:

              REP 5
              REP f ph
              REP ph f
              REP tion$ shun
              REP ^cooccurr co-occurr
              REP ^alot$ a_lot
    

    Did you add the first line, REP + number of replacements?