I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
Is that possible to chunk things together by using it? What I want is like this:
u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'
Thanks!
You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.
What I usually do is represent NER-tagged sentences as lists of triplets:
sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:
from nltk import Tree
def IOB_to_tree(iob_tagged):
root = Tree('S', [])
for token in iob_tagged:
if token[2] == 'O':
root.append((token[0], token[1]))
else:
try:
if root[-1].label() == token[2]:
root[-1].append((token[0], token[1]))
else:
root.append(Tree(token[2], [(token[0], token[1])]))
except:
root.append(Tree(token[2], [(token[0], token[1])]))
return root
sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
print IOB_to_tree(sentence)
The change in representation kind of makes sense because you certainly need POS tags for NER tagging.
The end result should look like:
(S
(PERSON Andrew/NNP)
is/VBZ
part/NN
of/IN
the/DT
(ORGANIZATION Republican/NNP Party/NNP)
in/IN
(LOCATION Dallas/NNP))