Good day.
I'm trying to use Hunspell as a stemmer in my application. I don't quite like porter and snowball stemming because of their "chopped" words results like "abus", "exampl". Lemmatizing seems like a good alternative, but I don't know any good CoreNLP alternatives, and I'm certainly not ready to port my project's source code to Java or use bridges yet. Ideally I would like to see initial, like-in-the-dictionary form of the given word.
As I've noticed most of the dictionaries has separate words in .dic file for: bid and bidding, set and setting, get and getting, etc. I'm not that experienced in Hunspell, but isn't there any clever way to handle double d or t for 3-letter word? Is there a way to make it think that "setting" is actually is derivated from "set"?
My current particular problem with Hunspell is I can't get a good comprehensive documentation for creating/editing an affix file. That's what documentations says here: http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html
(4) condition.
Zero stripping or affix are indicated by zero. Zero condition is
indicated by dot. Condition is a simplified, regular
expression-like pattern, which must be met before the affix can
be applied. (Dot signs an arbitrary character. Characters in
braces sign an arbitrary character from the character subset.
Dash hasn’t got special meaning, but circumflex (^) next the
first brace sets the complementer character set.)
Default one is this:
SFX G Y 2
SFX G e ing e
SFX G 0 ing [^e]
I've tried this one:
SFX G Y 4
SFX G e ing e
SFX G 0 ing [^e]
SFX G 0 ting [bcdfghjklmnpqrstvwxz][aeiou]t
SFX G 0 ding [bcdfghjklmnpqrstvwxz][aeiou]d
but it clearly will also match asSET. Is there any way to get around it somehow? I've tried ^ symbol at the start of regexp, but it seems like it's not working. What can I do to make it work?
Thanks in advance.
Why would it match asset? That's not a verb, and as such shouldn't have that suffix attached to it.
The problems that languages aren't perfectly regular. The solution that we've used in the Asturian spell checker at SoftAstur is to keep track a list of verbs that form certain suffixes one way or another, and have a script construct the .dic
file based on the lists we've kept.
So for English, you'd define two separate affixes1:
SFX Gs Y 3
SFX Gs e ing [^eoy]e
SFX Gs 0 ing [eoy]e
SFX Gs 0 ing [^e]
SFX Gd Y 9
SFX 0 bing [^aeiou][aeiou]b
SFX 0 king [^aeiou][aeiou]c
SFX 0 ding [^aeiou][aeiou]d
SFX 0 ling [^aeiou][aeiou]l # for British English
SFX 0 ming [^aeiou][aeiou]m
SFX 0 ning [^aeiou][aeiou]n
SFX 0 ping [^aeiou][aeiou]p
SFX 0 ring [^aeiou][aeiou]r
SFX 0 ting [^aeiou][aeiou]t
There are still other irregulars like singeing (to contrast with singing) that are uncommon enough they are probably best coded as separate. So your dictionary file then would like the following more or less:
admit/Gd --> admitting
bake/Gs --> baking
commit/Gd --> committed
free/Gs --> freeing
dye/Gs --> dyeing
inherit/Gs --> inherited
picnic/Gd --> picnicking
target/Gs --> targetting
tiptoe/Gs --> tiptoeing
travel/Gs --> traveling (if American English)
travel/Gd --> travelling (if British English)
refer/Gd --> referring
sing/Gs --> singing
singe
singing
sob/Gd --> sobbing
smile/Gs --> smiling
stop/Gd --> stopping
tap/Gd --> tapping
visit/Gs --> visiting
1. I prefer two-letter tags as they can be easier to read if you have a word with lots of tags, such that Gd
= gerund doubled and Gs
= gerund single or similar. Probably not a problem for English, but it definitely is for other languages. If you don't have a lot of affixes, you might just go with g
(no doubling) and G
(doubling).