Search code examples
stanford-nlp

How to speedup Stanford NLP in Python?


import numpy as np
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
    #english.all.3class.distsim.crf.ser.gz
st = StanfordNERTagger('/media/sf_codebase/modules/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
                           '/media/sf_codebase/modules/stanford-ner-2018-10-16/stanford-ner.jar',
                           encoding='utf-8')

After initializing above code Stanford NLP following code takes 10 second to tag the text as shown below. How to speed up?

%%time
text="My name is John Doe"
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)

Output

[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 20 ms, total: 24 ms
Wall time: 10.9 s

Solution

  • Another solution within NLTK is to not use the old nltk.tag.StanfordNERTagger but instead to use the newer nltk.parse.CoreNLPParser . See, e.g., https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK .

    More generally the secret to good performance is indeed to use a server on the Java side, which you can repeatedly call without having to start new subprocesses for each sentence processed. You can either use the NERServer if you just need NER or the StanfordCoreNLPServer for all CoreNLP functionality. There are a number of Python interfaces to it, see: https://stanfordnlp.github.io/CoreNLP/other-languages.html#python