I want to use StanfordNER
in python to detect named entities. How should i clean up the sentences?
for example, consider
qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and Xyz's Abcvd."
if i do
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
print st.tag(qry.split())
i get
[
(u'In', u'O'), (u'the', u'O'), (u'UK,', u'O'), (u'the', u'O'),
(u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'),
(u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'),
(u"Abc's", u'O'), (u'Popol', u'O'), (u'(market', u'O'), (u'leader)', u'O'),
(u'and', u'O'), (u"Xyz's", u'O'), (u'Abcvd.', u'O')
]
`
so only 1 named entities was detected. However, if i do some cleanup by replacing all special characters with spaces
qry="In the UK the class is relatively crowded with Zacc competing with Abc s Popol market leader and Xyz s Abcvd"
i get
[
(u'In', u'O'), (u'the', u'O'), (u'UK', u'LOCATION'), (u'the', u'O'),
(u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'),
(u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'),
(u'Abc', u'ORGANIZATION'), (u's', u'O'), (u'Popol', u'PERSON'), (u'market', u'O'),
(u'leader', u'O'), (u'and', u'O'), (u'Xyz', u'ORGANIZATION'), (u's', u'O'), (u'Abcvd', u'PERSON')]
`
so clearly, this is more appropriate. Are there any general rules on how to clean up sentences for StanfordNER
? Initially i thought that there is no cleanup required at all!
You can use Stanford Tokenizer for your purpose. You could use the code below.
from nltk.tokenize.stanford import StanfordTokenizer
token = StanfordTokenizer('stanford-ner-2014-06-16/stanford-ner.jar')
qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and Xyz's Abcvd."
tok = token.tokenize(qry)
print tok
You will get the tokens as you require them.
[u'In',
u'the',
u'UK',
u',',
u'the',
u'class',
u'is',
u'relatively',
u'crowded',
u'with',
u'Zacc',
u'competing',
u'with',
u'Abc',
u"'s",
u'Popol',
u'-LRB-',
u'market',
u'leader',
u'-RRB-',
u'and',
u'Xyz',
u"'s",
u'Abcvd',
u'.']