Search code examples

Named Entity Recognition using WEKA

I am new to WEKA and I want to ask you few questions regarding WEKA. I had follow this tutorial (Named Entity Recognition using WEKA).

But I am really confusing and have no idea at all.

  1. Is it possible if I want to filter the string by phrase not word/token?

For example in my .ARFF file:

  @attribute text string
  @attribute tag {CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNS, NNP, NNPS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD , VBG, VBN , VBP, VBZ, WDT, WP, WP$, WRB, ,, ., :}
  @attribute capital {Y, N}
  @attribute chunked {B-NP, I-NP, B-VP, I-VP, B-PP, I-PP, B-ADJP, B-ADVP , B-SBAR, B-PRT, O-Punctuation}
  @attribute @@class@@ {B-PER, I-PER, B-ORG, I-ORG, B-NUM, I-NUM, O, B-LOC, I-LOC}


So, when I filtered the String, it tokenized the string into word but what I want is, I want to tokenize/filter the string according to the phrase. For example extract the phrase "New York" not "New" and "York" according to the chunked attributes.

"B-NP" means start phrase and "I-NP" means next phrase (the middle or end of the phrase).

  1. How can i show the result for the classify class for example:

B-PER and I-PER to the class name PERSON?

                 TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                    0         0.021      0         0         0          0.768    B-PER
                    1         0.084      0.333     1         0.5        0.963    I-PER
                  0.167     0.054      0.167     0.167     0.167      0.313    B-ORG
                    0         0          0         0         0          0.964    I-ORG
                    0         0          0         0         0          0.281    B-NUM
                    0         0          0         0         0          0.148    I-NUM
                    0.972     0.074      0.972     0.972     0.972      0.949    O
                    0.875     0          1         0.875     0.933      0.977    B-LOC
                    0         0          0         0         0          0.907    I-LOC

Weighted Avg. 0.828 0.061 0.811 0.828 0.813 0.894


  • In my opinion, WEKA won't (currently) be the best machine learning software to do NER... as far as I know, WEKA does classify sets of examples, for NER it may be done either:

    1. By tokenizing sentences in tokens: in that case sequence (i.e. contiguity) will be lost... "New" and "York" are two separate examples, the fact that those words are contiguous won't be taken into account in any way.
    2. By keeping chunks / sentences as examples: sequences can then be kept as a whole and filtered (StringToWordVector for instance), but one class has to be associated for each chunk/sentence (for instance O+O+O+B-LOC+I-LOC+O is the class of the whole sentence in your example).

    In both cases, contiguity is not taken into account, which is really disturbing. Also, as far as I know, this is the same for R (?). This why "sequence labelling" (NER, morpho-syntax, syntax and dependencies) are usually done using software that determines a token category using current word, but also previous, next word, etc. and can output single tokens but also multitoken expressions or more complicated structures.

    For NER, currently, CRF are usually used for that, see:

    • CRF++
    • CRFSuite
    • Wapiti
    • Mallet
    • ...