Stanford NER provides it NER jars to detect POS tags and NERs. But I am facing one issue with one of the sentences when trying to parse. The sentence is as follows:
Joseph E. Seagram & Sons, INC said on Thursday that it is merging its two United States based wine companies
Below is my code
st = StanfordNERTagger('./stanford- ner/classifiers/english.all.3class.distsim.crf.ser.gz',
'./stanford-ner/stanford-ner.jar',
encoding='utf-8')
ne_in_sent = []
with open("./CCAT/2551newsML.txt") as fd:
lines = fd.readlines()
for line in lines:
print(line)
tokenized_text = word_tokenize(line)
classified_text = st.tag(tokenized_text)
ne_tree = stanfordNE2tree(classified_text)
for subtree in ne_tree:
# If subtree is a noun chunk, i.e. NE != "O"
if type(subtree) == Tree:
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print(ne_in_sent)
when I parse it I get the following entities as the organization. (Joseph E. Seagram & Sons, Organization) and (Inc, Organization)
Also for some other texts in the file like
TransCo has a very big plane. Transco is moving south.
It differentiates the organizations due to capitalization hence I get 2 entities (TransCo, organization) and (Transco, organization).
Is it possible to convert these into one entity?
Used Cosine similarity checker to check the similarity