After reading through the Hunspell docs, I started looking at the seemingly most advanced instance of a set of Hunspell dictionary files, and it seems the Hungarian one (Hun-garian Spell) is the most robust.
I have a few questions that seem to be unanswered by the 17 page PDF docs (which are the only real resource on Hunspell it appears, other than the source code).
For example, the number 1547
. We see it here:
AF @ # 1547
And it is used in PFX but not SFX:
PFX r 0 legújra/1547 . 24583
PFX r 0 legújjá/1547 . 24584
PFX r 0 legössze/1547 . 24585
PFX r 0 legát/1547 . 24586
PFX r 0 legáltal/1547 . 24587
PFX r 0 legvégig/1547 . 24588
PFX r 0 legvégbe/1547 . 24589
...
The thing after the slash is a flag as far as I learned, but where is that flag defined? The line AF @ # 1547
has 1547 as a comment, so not sure. Looking further at AF
it appears the first line of AF 1548
means there are 1548 AF values that follow, and AF @
is the second to last one in the list, so maybe that's it?!
So then when does the @
symbol mean in regards to AF
, which is said to be:
Hunspell can substitute affix flag sets with ordinal numbers in affix rules (alias compression, see
makealias
tool).
I'm not following....
PFX
?Like we have from above:
PFX r 0 legát/1547 . 24586
That is the only place 24586
appears in the .aff
file. So what does it mean? Same for all the numbers in that position. Line #24586 in the .dic
file doesn't seem related either:
lódenkabát/39 1
/number
mean in the .dic
file?Regarding that last example:
lódenkabát/39 1
What does /39
and the 1
mean? Where are those defined, I would have assumed to find a PFX 39
or SFX 39
defined in the .aff
file, but I don't seem to see that.
Learned more by looking at the tests around alias2.aff (and other alias2 files):
alias2.aff:
AF 2
AF AB
AF A
AM 3
AM is:affix_x
AM ds:affix_y
AM po:noun xx:other_data
SFX A Y 1
SFX A 0 x . 1
SFX B Y 1
SFX B 0 y/2 . 2
alias2.dic:
1
foo/1 3
alias2.good:
foo
foox
fooy
fooyx
alias2.morph:
> foo
analyze(foo) = st:foo po:noun xx:other_data
stem(foo) = foo
> foox
analyze(foox) = st:foo po:noun xx:other_data is:affix_x
stem(foox) = foo
> fooy
analyze(fooy) = st:foo po:noun xx:other_data ds:affix_y
stem(fooy) = fooy
> fooyx
analyze(fooyx) = st:foo po:noun xx:other_data ds:affix_y is:affix_x
stem(fooyx) = fooy
AM
Stands for "morphological alias"?
So this is saying we are dealing with line numbers relative to when the AM
and AF
start! That is crazy to me, so brittle. But anyways....
SFX A 0 x . 1
That 1
is referring to AM morphological_fields
(from the docs). So it is marking this suffix as AM 1
which is the first AM: is:affix_x
. That corresponds to our alias2.morph
file, where it shows:
> foox
analyze(foox) = st:foo po:noun xx:other_data is:affix_x
stem(foox) = foo
Notice the is:affix_x
.
Now, foox
has more. This is because in the .dic
file, it says:
foo/1 3
That 3
is pointing to another AM, which is the last one.
po:noun xx:other_data
So that gives us all three of the AMs shown in the alias2.morph
:
po:noun xx:other_data is:affix_x
AF
Stands for "affix flag".
The /1
here in the .dic
references the AF position:
foo/1
And the /2
in the .aff
does as well:
SFX B 0 y/2 . 2
So for the y/2
, that is saying that y
can come after suffix x
, since 2
links to AF 2
which is AF A
, which is linking to SFX A
, which is the x
suffix.
I'm a bit confused at foo/1
, which is an alias to foo/AB
, couldn't you just write foo/A
and it knows to allow foo/AB
because of the y/2
definition? Or foo/1
/ foo/AB
must be saying foo/A and foo/B allowed
, but foo/B
is only allowed after foo/A
, as per the SFX B
definition. That must be it.