python nlp nltk stanford-nlp named-entity-recognition

How to clean sentences for StanfordNER

I want to use StanfordNER in python to detect named entities. How should i clean up the sentences?

for example, consider

qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and Xyz's Abcvd."

if i do

st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
print st.tag(qry.split())

i get

[
    (u'In', u'O'), (u'the', u'O'), (u'UK,', u'O'), (u'the', u'O'), 
    (u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'), 
    (u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'), 
    (u"Abc's", u'O'), (u'Popol', u'O'), (u'(market', u'O'), (u'leader)', u'O'), 
    (u'and', u'O'), (u"Xyz's", u'O'), (u'Abcvd.', u'O')
]

so only 1 named entities was detected. However, if i do some cleanup by replacing all special characters with spaces

qry="In the UK the class is relatively crowded with Zacc competing with Abc s Popol market leader and Xyz s Abcvd"

i get

[
    (u'In', u'O'), (u'the', u'O'), (u'UK', u'LOCATION'), (u'the', u'O'), 
    (u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'), 
    (u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'), 
    (u'Abc', u'ORGANIZATION'), (u's', u'O'), (u'Popol', u'PERSON'), (u'market', u'O'), 
    (u'leader', u'O'), (u'and', u'O'), (u'Xyz', u'ORGANIZATION'), (u's', u'O'), (u'Abcvd', u'PERSON')]

so clearly, this is more appropriate. Are there any general rules on how to clean up sentences for StanfordNER? Initially i thought that there is no cleanup required at all!

Solution

You can use Stanford Tokenizer for your purpose. You could use the code below.

from nltk.tokenize.stanford import StanfordTokenizer
token = StanfordTokenizer('stanford-ner-2014-06-16/stanford-ner.jar')
qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and  Xyz's Abcvd."
tok = token.tokenize(qry)
print tok

You will get the tokens as you require them.

[u'In',
u'the',
u'UK',
u',',
u'the',
u'class',
u'is',
u'relatively',
u'crowded',
u'with',
u'Zacc',
u'competing',
u'with',
u'Abc',
u"'s",
u'Popol',
u'-LRB-',
u'market',
u'leader',
u'-RRB-',
u'and',
u'Xyz',
u"'s",
u'Abcvd',
u'.']