Search code examples
pythonnlpstanford-nlpnamed-entity-recognition

Error with NLTK package and other dependencies


I have installed the NLTK package and other dependencies and set the environment variables as follows:

STANFORD_MODELS=/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz:/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz:/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.conll.4class.distsim.crf.ser.gz

CLASSPATH=/mnt/d/stanford-ner/stanford-ner-2018-10-16/stanford-ner.jar

When I try to access the classifier like below:

stanford_classifier = os.environ.get('STANFORD_MODELS').split(':')[0]

stanford_ner_path = os.environ.get('CLASSPATH').split(':')[0]

st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')

I get the following error. But I don't understand what is causing this error.

Error: Could not find or load main class edu.stanford.nlp.ie.crf.CRFClassifier
OSError: Java command failed : ['/mnt/c/Program Files (x86)/Common 
Files/Oracle/Java/javapath_target_1133041234/java.exe', '-mx1000m', '-cp', '/mnt/d/stanford-ner/stanford-ner-2018-10-16/stanford-ner.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/mnt/d/stanford-ner/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz', '-textFile', '/tmp/tmpaiqclf_d', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']

Solution

  • I found the answer for this issue. I am using NLTK == 3.4. From NLTK ==3.3 and above Stanford NLP (POS, NER , tokenizer) are not loaded as part of nltk.tag but from nltk.parse.corenlp.CoreNLPParser. The stackoverflow answer is available in stackoverflow.com/questions/13883277/stanford-parser-and-nltk/… and the github link for official documentation is github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK.

    Additional information if you are facing timeout issue from the NER tagger or any other parser of coreNLP API, please increase the timeout limit as stated in https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK/_compare/3d64e56bede5e6d93502360f2fcd286b633cbdb9...f33be8b06094dae21f1437a6cb634f86ad7d83f7 by dimazest.