Search code examples
formatspell-checkinghunspell

Some questions about the usage of the Hunspell data format in the Hungarian Hunspell dictionary?


After reading through the Hunspell docs, I started looking at the seemingly most advanced instance of a set of Hunspell dictionary files, and it seems the Hungarian one (Hun-garian Spell) is the most robust.

I have a few questions that seem to be unanswered by the 17 page PDF docs (which are the only real resource on Hunspell it appears, other than the source code).

1. The meaning of the decimal numbers?

For example, the number 1547. We see it here:

AF @ # 1547

And it is used in PFX but not SFX:

PFX r 0 legújra/1547 . 24583
PFX r 0 legújjá/1547 . 24584
PFX r 0 legössze/1547 . 24585
PFX r 0 legát/1547 . 24586
PFX r 0 legáltal/1547 . 24587
PFX r 0 legvégig/1547 . 24588
PFX r 0 legvégbe/1547 . 24589
...

The thing after the slash is a flag as far as I learned, but where is that flag defined? The line AF @ # 1547 has 1547 as a comment, so not sure. Looking further at AF it appears the first line of AF 1548 means there are 1548 AF values that follow, and AF @ is the second to last one in the list, so maybe that's it?!

So then when does the @ symbol mean in regards to AF, which is said to be:

Hunspell can substitute affix flag sets with ordinal numbers in affix rules (alias compression, see makealias tool).

I'm not following....

2. The meaning of the last decimal numbers on PFX?

Like we have from above:

PFX r 0 legát/1547 . 24586

That is the only place 24586 appears in the .aff file. So what does it mean? Same for all the numbers in that position. Line #24586 in the .dic file doesn't seem related either:

lódenkabát/39   1

What do the /number mean in the .dic file?

Regarding that last example:

lódenkabát/39   1

What does /39 and the 1 mean? Where are those defined, I would have assumed to find a PFX 39 or SFX 39 defined in the .aff file, but I don't seem to see that.


Solution

  • Learned more by looking at the tests around alias2.aff (and other alias2 files):

    Files

    alias2.aff:

    AF 2
    AF AB
    AF A
    
    AM 3
    AM is:affix_x
    AM ds:affix_y
    AM po:noun xx:other_data
    
    SFX A Y 1
    SFX A 0 x . 1
    
    SFX B Y 1
    SFX B 0 y/2 . 2
    

    alias2.dic:

    1
    foo/1   3
    

    alias2.good:

    foo
    foox
    fooy
    fooyx
    

    alias2.morph:

    > foo
    analyze(foo) =  st:foo po:noun xx:other_data
    stem(foo) = foo
    > foox
    analyze(foox) =  st:foo po:noun xx:other_data is:affix_x
    stem(foox) = foo
    > fooy
    analyze(fooy) =  st:foo po:noun xx:other_data ds:affix_y
    stem(fooy) = fooy
    > fooyx
    analyze(fooyx) =  st:foo po:noun xx:other_data ds:affix_y is:affix_x
    stem(fooyx) = fooy
    

    Explanation

    Explaining the AM

    Stands for "morphological alias"?

    So this is saying we are dealing with line numbers relative to when the AM and AF start! That is crazy to me, so brittle. But anyways....

    SFX A 0 x . 1
    

    That 1 is referring to AM morphological_fields (from the docs). So it is marking this suffix as AM 1 which is the first AM: is:affix_x. That corresponds to our alias2.morph file, where it shows:

    > foox
    analyze(foox) =  st:foo po:noun xx:other_data is:affix_x
    stem(foox) = foo
    

    Notice the is:affix_x.

    Now, foox has more. This is because in the .dic file, it says:

    foo/1   3
    

    That 3 is pointing to another AM, which is the last one.

    po:noun xx:other_data
    

    So that gives us all three of the AMs shown in the alias2.morph:

    po:noun xx:other_data is:affix_x
    

    Explaining the AF

    Stands for "affix flag".

    The /1 here in the .dic references the AF position:

    foo/1
    

    And the /2 in the .aff does as well:

    SFX B 0 y/2 . 2
    

    So for the y/2, that is saying that y can come after suffix x, since 2 links to AF 2 which is AF A, which is linking to SFX A, which is the x suffix.

    I'm a bit confused at foo/1, which is an alias to foo/AB, couldn't you just write foo/A and it knows to allow foo/AB because of the y/2 definition? Or foo/1 / foo/AB must be saying foo/A and foo/B allowed, but foo/B is only allowed after foo/A, as per the SFX B definition. That must be it.