Search code examples
stanford-nlp

Stanford NER tool -- spaces in training file


I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,

/a/b/c sanferro 2

/d/e/f ginger 2

However, I run into errors while trying forms such as:

/a/b/c san ferro 2

Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output. How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.


Solution

  • Typically you use CoNLL style data to train a CRF. Here is an example:

    -DOCSTART-    O 
    
    John    PERSON
    Smith   PERSON
    went    O
    to      O
    France  LOCATION
    .       O
    
    Jane    PERSON
    Smith   PERSON
    went    O
    to      O
    Hawaii  LOCATION
    .       O
    

    A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.

    If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/

    Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml