Initially I had followed preprocessing steps like, stop words removal, HTML stripping, removing punctuation. However when I don't do this, the NER seems to perform better. Can anyone tell me what are preprocessing steps to be followed?
The only thing StanfordNER needs is clean text, by clean I mean, no HTML or any other kind of document meta-tags. Also, you shouldn't remove stop-words, these might be useful for the model in deciding which label to give to a certain word.
Just have a file with clean text:
echo "Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media." > test_file.txt
Then you will call stanford-ner.jar a pass it a trained model, e.g: classifiers/english.all.3class.distsim.crf.ser.gz
and an input file, e.g.: test_file.txt
Like this:
java -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile test_file.txt -outputFormat inlineXML
This should output something like this:
Switzerland LOCATION
, O
Davos PERSON
2018 O
: O
Soros PERSON
accuses O
Trump PERSON
of O
wanting O
a O
` O
mafia O
state O
' O
and O
blasts O
social O
media O
. O
As you can see you don't even need to handle tokenisation (e.g., find each unique token/word in the sentence) StanfordNER does that for you.
Another useful feature is to set up StanfordNER as a webservice:
java -mx2g -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.NERServer -loadClassifier my_model.ser.gz -textFile -port 9191 -outputFormat inlineXML
Then you can simple telnet or POST a sentence a get it back tagged:
telnet 127.0.0.1 9191
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media.
<LOCATION>Switzerland</LOCATION>, <PERSON>Davos</PERSON> 2018: <PERSON>Soros</PERSON> accuses <PERSON>Trump</PERSON> of wanting a 'mafia state' and blasts social media.
Connection closed by foreign host.