So I have some text data that's been messily parsed, and due to that I get names mixed in with the actual data. Is there any kind of package/library that helps identify whether a word is a name or not? (In this case, I would be assuming US/western/euro-centric names)
Otherwise, what would be a good way to flag this? Maybe train a model on a corpus of names and assign each word in the dataset a classification? Just not sure the best way to approach this problem/what kind of model would be suited, or if a solution already exists
import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""
for sent in nltk.sent_tokenize(text):
tokens = nltk.tokenize.word_tokenize(sent)
tags = st.tag(tokens)
for tag in tags:
if tag[1]=='PERSON': print(tag)