Search code examples
pythonnlpnltkstanford-nlpnamed-entity-recognition

how to speed up NE recognition with stanford NER with python nltk


First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER

This is probably due to calling st.tag() for each sentence, but is there any way to make it run faster?

EDIT

The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)

file format:

(sent_number, orig_sentence, NE_and_lemmatized_sentence)


Solution

  • From StanfordNERTagger, there is the tag_sents() function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

    >>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
    >>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
    >>> st.tag_sents(tokenized_sents)