I am trying to use the StanfordNERTagger and nltk to extract keywords from a piece of text.
docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."
words = re.split("\W+",docText)
stops = set(stopwords.words("english"))
#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]
str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']
print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged
this gives me
John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]
so clearly, things like Short
and Term
were tagged as NNP
. The data that i have contains many such instances where non NNP
words are capitalized. This might be due to typos or maybe they are headers. I dont have much control over that.
How can i parse or clean up the data so that i can detect a non NNP
term even though it may be capitalized? I dont want terms like Short
and Term
to be categorized as NNP
Also, not sure why John Donk
was captured as a person but Brian Jones
was not. Could it be due to the other capitalized non NNP
s in my data? Could that be having an effect on how the StanfordNERTagger
treats everything else?
Update, one possible solution
Here is what i plan to do
NNP
then we know that the original word must also be an NNP
Here is what i tried to do
str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for word in str.split():
wl = word.lower()
print wl
w,pos = stp.tag(wl)
print pos
if pos=="NNP":
print "Got NNP"
print w
but this gives me error
John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
File "X:\crp.py", line 37, in <module>
w,pos = stp.tag(wl)
ValueError: too many values to unpack
i have tried multiple approaches but some error always shows up. How can i Tag a single word?
I dont want to convert the whole string to lower case and then Tag. If i do that, the StanfordPOSTagger
returns an empty string
Firstly, see your other question to setup Stanford CoreNLP to be called from command-line or python: nltk : How to prevent stemming of proper nouns.
For the proper cased sentence we see that the NER works properly:
>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O
And for the lowered cased sentence, you will not get NNP
for POS tag nor any NER tag:
>>> for token in annotated_sent1['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
So the question to your question should be:
And after answering those questions, you can move on to decide what you really want to do with the NER tags, i.e.
If the input is lower-cased and it's because of how you structured your NLP tool chain, then
If the input is lower-cased because that's how the original data was, then:
If the input has erroneous casing, e.g. `Some big Some Small but not all are Proper Noun, then