When you build you model file with -t option by crf_learn: crf_learn template train_data -t model
It will then generate two model file, one of them is model.txt.
Can anybody tell what does the float numbers mean?
See the following example:
version: 100 cost-factor: 1 maxid: 40 xsize: 1
B I
U00:%x[0,0] B
36 B 20 U00:、 26 U00:か 18 U00:が 22 U00:こ 8 U00:た 10 U00:ち 2 U00:っ 4 U00:て 34 U00:に 12 U00:の 0 U00:よ 28 U00:ら 24 U00:れ 32 U00:上 14 U00:世 16 U00:代 30 U00:地 6 U00:私
-0.3022268562246992 0.3022268562246989 -0.3629407244093161 0.3629407244093156 -0.3327259487028221 0.3327259487028215 0.3462799099537973 -0.3462799099537980 0.3452020097664334 -0.3452020097664336 -0.3218750203631590 0.3218750203631575 0.0376944272290242 -0.0376944272290280 0.3329631783491211 -0.3329631783491230 -0.3092967308014029 0.3092967308014015 0.3413769126433928 -0.3413769126433950 0.3786782765859961 -0.3786782765859980 0.5208645073272351 -0.5208645073272384 -0.3261580548802839 0.3261580548802814 -0.3615756495615902 0.3615756495615884 -0.3248593224319323 0.3248593224319312 0.3281895709166696 -0.3281895709166719 -0.3040331359589971 0.3040331359589951 0.2836939567332580 -0.2836939567332600 -0.1530917919770705 -0.1613508585854637 0.4245699543724943 -0.1101273038099901
My understanding is: each float number should correspond to each template, for instance: first float number "-0.3022268562246992" should correspond to "36 B". But why the number of float number double the number of template? what does those float number mean?
Many thanks,
Shuai Hua
After reading parts of the CRF++058 source code, I know how to understand the crf_learn output. I will use some examples to explain the output.
==== Basic ====
Let's assume we have the following training data:
毎 k B 日 k I 新 k I 聞 k I 社 k I 特 k B 別 k I 顧 k B 問 k I
And our template is very simple, only has one line: U00:%x[0,0]
U00:%x[0,0]
U00:%x[-1,0]/%x[0,0]/%x[1,0]
Now we have two "features" in template. So the total number of feature changes to 18, there are:
毎, 日, 新, 聞, 社, 特, 別, 顧, 問 ../毎/日 毎/日/新 日/新/聞 新/聞/社 聞/社/特 社/特/別 特/別/顧 別/顧/問 顧/問/..
(This feature template with two rules will apply to each single word)
毎 k B 毎 k B 日 k I 新 k I 聞 k I 社 k I 特 k B 別 k I 顧 k B 問 k I
For the word "毎", it appears twice, but only be regarded as one feature. So the number of feature still 18.
==== Advance ====
Now let's see how to understand the content in "model.txt".
1) a SPACE LINE is used to delimit different block:
1. First block:
version: 100 cost-factor: 1 maxid: 670 xsize: 1
The maxid depends on numbers of features, and numbers of tags.
Using the first training data as example:(9 different words, and two tags => B and I)
the id should start from 0, 0+2=2, 2+2=4, ... 16. maxid is 16.
Here, why the step is 2?
Because we have two types of tag. actually each word corresponds to two different tags, like:
0 毎 ==> B 1 毎 ==> I 2 日 ==> B 3 日 ==> I ... 14 問 ==> B 15 問 ==> I
2. second block:
list all the tags in the training data:
B I
3. third block:
list all the template used:
U00:%x[0,0]
B
4. fourth block:
the feature id, the template and the correspond word:
0 U00:毎 2 U00:日 ...
5. the fifth block:
For each feature, the possibility for each tag:
There are two possibility correspond to each word.
Possibility < 0 will be ignored.
- Shuai Hua