Search code examples
nlpstemminghunspell

Hunspell affix condition regex format. Any way to match the start?


Good day.

I'm trying to use Hunspell as a stemmer in my application. I don't quite like porter and snowball stemming because of their "chopped" words results like "abus", "exampl". Lemmatizing seems like a good alternative, but I don't know any good CoreNLP alternatives, and I'm certainly not ready to port my project's source code to Java or use bridges yet. Ideally I would like to see initial, like-in-the-dictionary form of the given word.

As I've noticed most of the dictionaries has separate words in .dic file for: bid and bidding, set and setting, get and getting, etc. I'm not that experienced in Hunspell, but isn't there any clever way to handle double d or t for 3-letter word? Is there a way to make it think that "setting" is actually is derivated from "set"?

My current particular problem with Hunspell is I can't get a good comprehensive documentation for creating/editing an affix file. That's what documentations says here: http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html

(4) condition.

Zero stripping or affix are indicated by zero. Zero condition is
indicated   by   dot. Condition is a simplified, regular
expression-like pattern, which must be met before the affix  can
be  applied. (Dot  signs  an arbitrary character. Characters in
braces sign an arbitrary character from  the  character  subset.
Dash  hasn’t  got  special  meaning, but circumflex (^) next the
first brace sets the complementer character set.)

Default one is this:

SFX G Y 2
SFX G   e     ing        e
SFX G   0     ing        [^e] 

I've tried this one:

SFX G Y 4
SFX G   e     ing        e
SFX G   0     ing        [^e] 
SFX G   0     ting       [bcdfghjklmnpqrstvwxz][aeiou]t 
SFX G   0     ding       [bcdfghjklmnpqrstvwxz][aeiou]d 

but it clearly will also match asSET. Is there any way to get around it somehow? I've tried ^ symbol at the start of regexp, but it seems like it's not working. What can I do to make it work?

Thanks in advance.


Solution

  • Why would it match asset? That's not a verb, and as such shouldn't have that suffix attached to it.

    The problems that languages aren't perfectly regular. The solution that we've used in the Asturian spell checker at SoftAstur is to keep track a list of verbs that form certain suffixes one way or another, and have a script construct the .dic file based on the lists we've kept.

    So for English, you'd define two separate affixes1:

    SFX Gs Y 3
    SFX Gs e ing [^eoy]e
    SFX Gs 0 ing [eoy]e
    SFX Gs 0 ing [^e]
    
    SFX Gd Y 9
    SFX 0 bing [^aeiou][aeiou]b
    SFX 0 king [^aeiou][aeiou]c
    SFX 0 ding [^aeiou][aeiou]d
    SFX 0 ling [^aeiou][aeiou]l   # for British English
    SFX 0 ming [^aeiou][aeiou]m
    SFX 0 ning [^aeiou][aeiou]n
    SFX 0 ping [^aeiou][aeiou]p
    SFX 0 ring [^aeiou][aeiou]r
    SFX 0 ting [^aeiou][aeiou]t
    

    There are still other irregulars like singeing (to contrast with singing) that are uncommon enough they are probably best coded as separate. So your dictionary file then would like the following more or less:

    admit/Gd    --> admitting
    bake/Gs     --> baking
    commit/Gd   --> committed
    free/Gs     --> freeing
    dye/Gs      --> dyeing
    inherit/Gs  --> inherited
    picnic/Gd   --> picnicking
    target/Gs   --> targetting
    tiptoe/Gs   --> tiptoeing
    travel/Gs   --> traveling  (if American English)
    travel/Gd   --> travelling (if British English)
    refer/Gd    --> referring
    sing/Gs     --> singing
    singe
    singing
    sob/Gd      --> sobbing
    smile/Gs    --> smiling
    stop/Gd     --> stopping
    tap/Gd      --> tapping
    visit/Gs    --> visiting
    

    1. I prefer two-letter tags as they can be easier to read if you have a word with lots of tags, such that Gd = gerund doubled and Gs = gerund single or similar. Probably not a problem for English, but it definitely is for other languages. If you don't have a lot of affixes, you might just go with g (no doubling) and G (doubling).