I am trying to extract named entities from text using Stanford-NER. I have read all related threads regarding chunking and did not find anything to solve the problem I am having.
Input:
The united nations is holding a meeting in the united states of America.
Expected Output:
united nations/organization
united states of America/location
I was able to get this output, but it doesn't combine tokens for multi-work named entities:
[('The', 'O'), ('united', 'ORGANIZATION'), ('nations', 'ORGANIZATION'), ('is', 'O'), ('holding', 'O'), ('a', 'O'), ('meeting', 'O'), ('in', 'O'), ('the', 'O'), ('united', 'LOCATION'), ('states', 'LOCATION'), ('of', 'LOCATION'), ('America', 'LOCATION'), ('.', 'O')]
or in a tree format:
(S
The/O
united/ORGANIZATION
nations/ORGANIZATION
is/O
holding/O
a/O
meeting/O
in/O
the/O
united/LOCATION
states/LOCATION
of/LOCATION
America/LOCATION
./O)
I am looking for this output:
[('The', 'O'), ('united nations', 'ORGANIZATION'), ('is', 'O'), ('holding', 'O'), ('a', 'O'), ('meeting', 'O'), ('in', 'O'), ('the', 'O'), ('united states of America', 'LOCATION'), ('.', 'O')]
When I tried some of the code I found in other threads to join named entities in the tree format, it returned an empty list.
import nltk
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os
java_path = "C:\Program Files (x86)\Java\jre1.8.0_251/java.exe"
os.environ['JAVAHOME'] = java_path
st = StanfordNERTagger(r'stanford-ner-4.0.0/stanford-ner-4.0.0/classifiers/english.all.3class.distsim.crf.ser.gz',
r'stanford-ner-4.0.0/stanford-ner-4.0.0/stanford-ner.jar',
encoding='utf-8')
text = 'The united nations is holding a meeting in the united states of America.'
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
namedEnt = nltk.ne_chunk(classified_text, binary = True)
#this line makes the tree return an empty list
np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.label() == "NE"]
print(np)
print(classified_text)
The StanfordNERTagger in nltk doesn't retain information on the boundaries of named entities. If you try to parse the output of the tagger, there is no way to tell whether two consecutive nouns with the same tag are part of the same entity or whether they are distinct.
Alternatively, https://stanfordnlp.github.io/CoreNLP/other-languages.html#python indicates that the Stanford team is actively developing a python package called Stanza which uses the Stanford CoreNLP. It is slow, but really easy to use.
$ pip3 install stanza
>>> import stanza
>>> stanza.download ('en')
>>> nlp = stanza.Pipeline ('en')
>>> results = nlp (<insert your text string here>)
The chunked entities are in results.ents
.