What are the preprocessing steps to be taken before passing text into Stanford NER tagger?

Initially I had followed preprocessing steps like, stop words removal, HTML stripping, removing punctuation. However when I don't do this, the NER seems to perform better. Can anyone tell me what are preprocessing steps to be followed?

Solution

The only thing StanfordNER needs is clean text, by clean I mean, no HTML or any other kind of document meta-tags. Also, you shouldn't remove stop-words, these might be useful for the model in deciding which label to give to a certain word.

Just have a file with clean text:

echo "Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media." > test_file.txt

Then you will call stanford-ner.jar a pass it a trained model, e.g: classifiers/english.all.3class.distsim.crf.ser.gz and an input file, e.g.: test_file.txt

Like this:

java -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile test_file.txt -outputFormat inlineXML

This should output something like this:

Switzerland LOCATION
,   O
Davos   PERSON
2018    O
:   O
Soros   PERSON
accuses O
Trump   PERSON
of  O
wanting O
a   O
`   O
mafia   O
state   O
'   O
and O
blasts  O
social  O
media   O
.   O

As you can see you don't even need to handle tokenisation (e.g., find each unique token/word in the sentence) StanfordNER does that for you.

Another useful feature is to set up StanfordNER as a webservice:

java -mx2g -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.NERServer -loadClassifier my_model.ser.gz -textFile -port 9191 -outputFormat inlineXML

Then you can simple telnet or POST a sentence a get it back tagged:

telnet 127.0.0.1 9191
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media.

<LOCATION>Switzerland</LOCATION>, <PERSON>Davos</PERSON> 2018: <PERSON>Soros</PERSON> accuses <PERSON>Trump</PERSON> of wanting a 'mafia state' and blasts social media.

Connection closed by foreign host.