I'm doing Natural Language Processing (NLP) in Python 3 and more specifically Named Entity Recognition (NER) on the Harry Potter set of books. I'm using StanfordNER, which works pretty well but takes incredible amounts of time...
I have done some research online on why it would be this slow but I can't seem to find anything that truly suits my code, and I honestly think the problem lays more in the (bad) way I have written the code.
So here's what I wrote for now :
import string
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk.tag.stanford as st
tagger = st.StanfordNERTagger('_path_/stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz', '_path_/stanford-ner-2017-06-09/stanford-ner.jar')
#this is just to read the file
hp = open("books/hp1.txt", 'r', encoding='utf8')
lhp = hp.readlines()
#a small function I wrote to divide the book in sentences
def get_sentences(lbook):
sentences = []
for k in lbook:
j = sent_tokenize(k)
for i in j:
if bool(i):
sentences.append(i)
return sentences
#a function to divide a sentence into words
def get_words(sentence):
words = word_tokenize(sentence)
return words
sentences = get_sentences(lhp)
#and now the code I wrote to get all the words labeled as PERSON by the StanfordNER tagger
characters = []
for i in sentence:
characters = [tag[0] for tag in tagger.tag(get_words(sentences[i])) if tag[1]=="PERSON"]
print(characters)
Now the problem, as I explained, is that the code takes huge amounts of time... So I'm wondering, is it normal or can I save time by rewriting the code in a better way ? If so, could you help me out ?
The bottleneck is the tagger.tag
method, it has a big overhead. Therefore, calling it for every sentence results in a really slow program. Unless there's an additional need for splitting the book into sentences, I'd process the whole text at once:
with open('books/hp1.txt', 'r') as content_file:
all_text = content_file.read()
tags = tagger.tag(word_tokenize(all_text))
characters = [tag[0] for tag in tags if tag[1] == "PERSON"]
print(characters)
Now if what you wanted to know is, say, what sentence each character is mentioned in, then you could first get the characters' names in characters
like in the code above, and then loop through the sentences checking if an element from characters
exists there.
If file size is a concern (although a .txt file of most books shouldn't be a problem to load into memory), then instead of reading the whole book, you could read a number n
of sentences at once. From your code, modify your for loop like so:
n = 1000
for i in range(0, len(sentences), n):
scs = '. '.join(sentences[i:i + n])
characters = [tag[0] for tag in tagger.tag(get_words(scs)) if tag[1]=="PERSON"]
The general idea is to minimize the calls to tagger.tag
for its big overhead.