Search code examples
pythonnlpstanford-nlp

What are the preprocessing steps to be taken before passing text into Stanford NER tagger?


Initially I had followed preprocessing steps like, stop words removal, HTML stripping, removing punctuation. However when I don't do this, the NER seems to perform better. Can anyone tell me what are preprocessing steps to be followed?


Solution

  • The only thing StanfordNER needs is clean text, by clean I mean, no HTML or any other kind of document meta-tags. Also, you shouldn't remove stop-words, these might be useful for the model in deciding which label to give to a certain word.

    Just have a file with clean text:

    echo "Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media." > test_file.txt
    

    Then you will call stanford-ner.jar a pass it a trained model, e.g: classifiers/english.all.3class.distsim.crf.ser.gz and an input file, e.g.: test_file.txt

    Like this:

    java -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile test_file.txt -outputFormat inlineXML
    

    This should output something like this:

    Switzerland LOCATION
    ,   O
    Davos   PERSON
    2018    O
    :   O
    Soros   PERSON
    accuses O
    Trump   PERSON
    of  O
    wanting O
    a   O
    `   O
    mafia   O
    state   O
    '   O
    and O
    blasts  O
    social  O
    media   O
    .   O
    

    As you can see you don't even need to handle tokenisation (e.g., find each unique token/word in the sentence) StanfordNER does that for you.

    Another useful feature is to set up StanfordNER as a webservice:

    java -mx2g -cp stanford-ner-2017-06-09/stanford-ner.jar edu.stanford.nlp.ie.NERServer -loadClassifier my_model.ser.gz -textFile -port 9191 -outputFormat inlineXML
    

    Then you can simple telnet or POST a sentence a get it back tagged:

    telnet 127.0.0.1 9191
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    Switzerland, Davos 2018: Soros accuses Trump of wanting a 'mafia state' and blasts social media.
    
    <LOCATION>Switzerland</LOCATION>, <PERSON>Davos</PERSON> 2018: <PERSON>Soros</PERSON> accuses <PERSON>Trump</PERSON> of wanting a 'mafia state' and blasts social media.
    
    Connection closed by foreign host.