Search code examples
python-3.xnlpnltknamed-entity-recognition

Named Entity Recognition using NLTK: Extract Auditor name, address and organisation


I am trying to use nltk to identify Person, Organization and Place from a sentence.

My Use Case is to basically extract Auditor name, organization and Place from an annual financial report

With nltk in python the results don't seem to be really satisfactory

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

ex='Alastair John Richard Nuttall (Senior statutory auditor) for and on behalf of Ernst & Young LLP (Statutory auditor) Leeds'

ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))

print(ne_tree)

Tree('S', [Tree('PERSON', [('Alastair', 'NNP')]), Tree('PERSON', [('John', 'NNP'), ('Richard', 'NNP'), ('Nuttall', 'NNP')]), ('(', '('), Tree('ORGANIZATION', [('Senior', 'NNP')]), ('statutory', 'NNP'), ('auditor', 'NN'), (')', ')'), ('for', 'IN'), ('and', 'CC'), ('on', 'IN'), ('behalf', 'NN'), ('of', 'IN'), Tree('GPE', [('Ernst', 'NNP')]), ('&', 'CC'), Tree('PERSON', [('Young', 'NNP'), ('LLP', 'NNP')]), ('(', '('), ('Statutory', 'NNP'), ('auditor', 'NN'), (')', ')'), ('Leeds', 'NNS')])

As seen above 'Leeds' is not identified as place nor is Ernst & Young LLP recognized as Organization

Are there any better ways of achieving this in Python?


Solution

  • Try spacy instead of NLTK:

    https://spacy.io/usage/linguistic-features#named-entities

    I think spacy's pretrained models are likely to perform better. The results (with spacy 2.1, en_core_web_lg) for your sentence are:

    Alastair John Richard Nuttall PERSON
    Ernst & Young LLP ORG
    Leeds GPE