python python-3.x nlp stanford-nlp named-entity-recognition

NLP - Speed of Named Entity Recognition (StanfordNER)

I'm doing Natural Language Processing (NLP) in Python 3 and more specifically Named Entity Recognition (NER) on the Harry Potter set of books. I'm using StanfordNER, which works pretty well but takes incredible amounts of time...

I have done some research online on why it would be this slow but I can't seem to find anything that truly suits my code, and I honestly think the problem lays more in the (bad) way I have written the code.

So here's what I wrote for now :

import string
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk.tag.stanford as st

tagger = st.StanfordNERTagger('_path_/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz', '_path_/stanford-ner-2017-06-09/stanford-ner.jar')

#this is just to read the file

hp = open("books/hp1.txt", 'r', encoding='utf8')
lhp = hp.readlines()

#a small function I wrote to divide the book in sentences

def get_sentences(lbook):
    sentences = []
    for k in lbook:
        j = sent_tokenize(k)
        for i in j:
            if bool(i):
                sentences.append(i)
    return sentences

#a function to divide a sentence into words

def get_words(sentence):
    words = word_tokenize(sentence)
    return words

sentences = get_sentences(lhp)

#and now the code I wrote to get all the words labeled as PERSON by the StanfordNER tagger

characters = []
    for i in sentence:
    characters = [tag[0] for tag in tagger.tag(get_words(sentences[i])) if tag[1]=="PERSON"]
    print(characters)

Now the problem, as I explained, is that the code takes huge amounts of time... So I'm wondering, is it normal or can I save time by rewriting the code in a better way ? If so, could you help me out ?

Solution

The bottleneck is the tagger.tag method, it has a big overhead. Therefore, calling it for every sentence results in a really slow program. Unless there's an additional need for splitting the book into sentences, I'd process the whole text at once:

with open('books/hp1.txt', 'r') as content_file:
    all_text = content_file.read()
    tags = tagger.tag(word_tokenize(all_text))
    characters = [tag[0] for tag in tags if tag[1] == "PERSON"]
    print(characters)

Now if what you wanted to know is, say, what sentence each character is mentioned in, then you could first get the characters' names in characters like in the code above, and then loop through the sentences checking if an element from characters exists there.

If file size is a concern (although a .txt file of most books shouldn't be a problem to load into memory), then instead of reading the whole book, you could read a number n of sentences at once. From your code, modify your for loop like so:

n = 1000
for i in range(0, len(sentences), n):
    scs = '. '.join(sentences[i:i + n])
    characters = [tag[0] for tag in tagger.tag(get_words(scs)) if tag[1]=="PERSON"]

The general idea is to minimize the calls to tagger.tag for its big overhead.