Search code examples
pythonnlp

How to recognize if string is human name?


So I have some text data that's been messily parsed, and due to that I get names mixed in with the actual data. Is there any kind of package/library that helps identify whether a word is a name or not? (In this case, I would be assuming US/western/euro-centric names)

Otherwise, what would be a good way to flag this? Maybe train a model on a corpus of names and assign each word in the dataset a classification? Just not sure the best way to approach this problem/what kind of model would be suited, or if a solution already exists


Solution

  • import nltk
    from nltk.tag.stanford import NERTagger
    st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
    text = """YOUR TEXT GOES HERE"""
    
    for sent in nltk.sent_tokenize(text):
        tokens = nltk.tokenize.word_tokenize(sent)
        tags = st.tag(tokens)
        for tag in tags:
            if tag[1]=='PERSON': print tag
    

    via Improving the extraction of human names with nltk