I am searching way to use stanford word tokenizer in nltk, I want to use because when I compare results of stanford and nltk word tokenizer, they both are different. I know there might be way to use stanford tokenizer, like we can stanford POS Tagger and NER in NLTK.
Is it possible to do use stanford tokenizer without running server?
Thanks
Outside of NLTK, you can use the official Python interface that's recently release by Stanford NLP:
cd ~
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
pip3 install -U https://github.com/stanfordnlp/python-stanford-corenlp/archive/master.zip
# On Mac
export CORENLP_HOME=/Users/<username>/stanford-corenlp-full-2016-10-31/
# On linux
export CORENLP_HOME=/home/<username>/stanford-corenlp-full-2016-10-31/
>>> import corenlp
>>> with corenlp.client.CoreNLPClient(annotators="tokenize ssplit".split()) as client:
... ann = client.annotate(text)
...
[pool-1-thread-4] INFO CoreNLP - [/0:0:0:0:0:0:0:1:55475] API call w/annotators tokenize,ssplit
Chris wrote a simple sentence that he parsed with Stanford CoreNLP.
>>> sentence = ann.sentence[0]
>>>
>>> [token.word for token in sentence.token]
['Chris', 'wrote', 'a', 'simple', 'sentence', 'that', 'he', 'parsed', 'with', 'Stanford', 'CoreNLP', '.']