Search code examples
pythonnlpnltkstanford-nlpnamed-entity-recognition

Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format


I am using NER in NLTK to find persons, locations, and organizations in sentences. I am able to produce the results like this:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]

Is that possible to chunk things together by using it? What I want is like this:

u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'

Thanks!


Solution

  • You can use the standard NLTK way of representing chunks using nltk.Tree. This might mean that you have to change your representation a bit.

    What I usually do is represent NER-tagged sentences as lists of triplets:

    sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
    

    I do this when I use an external tool for NER tagging a sentence. Now you can transform this sentence the NLTK representation:

    from nltk import Tree
    
    
    def IOB_to_tree(iob_tagged):
        root = Tree('S', [])
        for token in iob_tagged:
            if token[2] == 'O':
                root.append((token[0], token[1]))
            else:
                try:
                    if root[-1].label() == token[2]:
                        root[-1].append((token[0], token[1]))
                    else:
                        root.append(Tree(token[2], [(token[0], token[1])]))
                except:
                    root.append(Tree(token[2], [(token[0], token[1])]))
    
        return root
    
    
    sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]
    print IOB_to_tree(sentence)
    

    The change in representation kind of makes sense because you certainly need POS tags for NER tagging.

    The end result should look like:

    (S
      (PERSON Andrew/NNP)
      is/VBZ
      part/NN
      of/IN
      the/DT
      (ORGANIZATION Republican/NNP Party/NNP)
      in/IN
      (LOCATION Dallas/NNP))