I am using Hunspell to stem words for a SOLR instance. For the most part, it seems to be working well.
I'm using the OpenOffice dic/aff files.
However, there are some notable word exceptions, and I'd like to be able to remove these as candidates for stemming.
A great example is "skier", which stems to "sky" because of the following:
in the .dic file
sky/MDRSGZ
relevant rule in the .aff file
SFX R y ier [^aeiou]y
Is there any way to indicate that skier
and only skier
should be left alone?
Yeah this is a very common thing, just remove the "R"
sky/MDSGZ
But you may then want to add back in on another line "skier" and any other versions of it.
skier/MS
I have had to make numerous changes to this file, and now really wish there was a better option. For example
And then another one that is really confusing,
On my site before we fixed it if you searched for wind like in "wind power" you ended up with a bunch of bruises and bloody wounds. Because "wound" like in "I wound the clock" stemmed to wind.
We also decided to remove all RE prefixes. because things like
So if you know of a better dictionary that is better for this please let me know. (I think the main problem is this dictionary is more intended for spell check then for stemming)
I would be willing to start and/or contribute to a git project for a real stemming dictionary to replace this spelling dictionary for everyone out there using this.