Search code examples
pythonnlpnltkstanford-nlpnamed-entity-recognition

How to clean sentences for StanfordNER


I want to use StanfordNER in python to detect named entities. How should i clean up the sentences?

for example, consider

qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and Xyz's Abcvd."

if i do

st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
print st.tag(qry.split())

i get

[
    (u'In', u'O'), (u'the', u'O'), (u'UK,', u'O'), (u'the', u'O'), 
    (u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'), 
    (u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'), 
    (u"Abc's", u'O'), (u'Popol', u'O'), (u'(market', u'O'), (u'leader)', u'O'), 
    (u'and', u'O'), (u"Xyz's", u'O'), (u'Abcvd.', u'O')
]

`

so only 1 named entities was detected. However, if i do some cleanup by replacing all special characters with spaces

qry="In the UK the class is relatively crowded with Zacc competing with Abc s Popol market leader and Xyz s Abcvd"

i get

[
    (u'In', u'O'), (u'the', u'O'), (u'UK', u'LOCATION'), (u'the', u'O'), 
    (u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'), 
    (u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'), 
    (u'Abc', u'ORGANIZATION'), (u's', u'O'), (u'Popol', u'PERSON'), (u'market', u'O'), 
    (u'leader', u'O'), (u'and', u'O'), (u'Xyz', u'ORGANIZATION'), (u's', u'O'), (u'Abcvd', u'PERSON')]

`

so clearly, this is more appropriate. Are there any general rules on how to clean up sentences for StanfordNER? Initially i thought that there is no cleanup required at all!


Solution

  • You can use Stanford Tokenizer for your purpose. You could use the code below.

    from nltk.tokenize.stanford import StanfordTokenizer
    token = StanfordTokenizer('stanford-ner-2014-06-16/stanford-ner.jar')
    qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and  Xyz's Abcvd."
    tok = token.tokenize(qry)
    print tok
    

    You will get the tokens as you require them.

    [u'In',
    u'the',
    u'UK',
    u',',
    u'the',
    u'class',
    u'is',
    u'relatively',
    u'crowded',
    u'with',
    u'Zacc',
    u'competing',
    u'with',
    u'Abc',
    u"'s",
    u'Popol',
    u'-LRB-',
    u'market',
    u'leader',
    u'-RRB-',
    u'and',
    u'Xyz',
    u"'s",
    u'Abcvd',
    u'.']