Stanford NER tool -- spaces in training file

I've been looking through the Stanford NER classifier. I have been able to train a model using a simple file that has spaces only to delimit the items the system expects. For instance,

/a/b/c sanferro 2

/d/e/f ginger 2

However, I run into errors while trying forms such as:

/a/b/c san ferro 2

Here "san ferro" is a single "word" and "2" is the "answer" or desired labeling output. How can I encode spaces? I've tried enclosing a double quotes but that doesn't work.

Solution

Typically you use CoNLL style data to train a CRF. Here is an example:

-DOCSTART-    O 

John    PERSON
Smith   PERSON
went    O
to      O
France  LOCATION
.       O

Jane    PERSON
Smith   PERSON
went    O
to      O
Hawaii  LOCATION
.       O

A "\t" character separates the tokens and the tags. You put a blank space in between the sentences. You use the special symbol "-DOCSTART-" to indicate where a new document starts. Typically you provide a large set of sentences. This is the case when you are training a CRF.

If you just want to tag certain patterns the same way all the time, you may want to use RegexNER, which is described here: http://nlp.stanford.edu/software/regexner/

Here is more documentation on using the NER system: http://nlp.stanford.edu/software/crf-faq.shtml