Search code examples
solrhunspell

How to indicate an word exception for stemming in Hunspell


I am using Hunspell to stem words for a SOLR instance. For the most part, it seems to be working well.

I'm using the OpenOffice dic/aff files.

However, there are some notable word exceptions, and I'd like to be able to remove these as candidates for stemming.

A great example is "skier", which stems to "sky" because of the following:

in the .dic file
sky/MDRSGZ

relevant rule in the .aff file
SFX R   y     ier        [^aeiou]y

Is there any way to indicate that skier and only skier should be left alone?


Solution

  • Yeah this is a very common thing, just remove the "R"

    sky/MDSGZ
    

    But you may then want to add back in on another line "skier" and any other versions of it.

    skier/MS
    

    I have had to make numerous changes to this file, and now really wish there was a better option. For example

    • Butter -> Butt
    • Corner -> Corn
    • Easter -> East

    And then another one that is really confusing,

    • Wind == Wound

    On my site before we fixed it if you searched for wind like in "wind power" you ended up with a bunch of bruises and bloody wounds. Because "wound" like in "I wound the clock" stemmed to wind.

    We also decided to remove all RE prefixes. because things like

    • remarkable -> mark
    • remove -> move
    • reset -> set
    • restore -> store

    So if you know of a better dictionary that is better for this please let me know. (I think the main problem is this dictionary is more intended for spell check then for stemming)

    I would be willing to start and/or contribute to a git project for a real stemming dictionary to replace this spelling dictionary for everyone out there using this.