Search code examples
pythonnltkstanford-nlptokenize

How to use stanford word tokenizer in NLTK?


I am searching way to use stanford word tokenizer in nltk, I want to use because when I compare results of stanford and nltk word tokenizer, they both are different. I know there might be way to use stanford tokenizer, like we can stanford POS Tagger and NER in NLTK.

Is it possible to do use stanford tokenizer without running server?

Thanks


Solution

  • Outside of NLTK, you can use the official Python interface that's recently release by Stanford NLP:

    Install

    cd ~
    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31
    pip3 install -U https://github.com/stanfordnlp/python-stanford-corenlp/archive/master.zip
    

    Setup Environment

    # On Mac
    export CORENLP_HOME=/Users/<username>/stanford-corenlp-full-2016-10-31/
    
    # On linux
    export CORENLP_HOME=/home/<username>/stanford-corenlp-full-2016-10-31/
    

    In Python

    >>> import corenlp
    >>> with corenlp.client.CoreNLPClient(annotators="tokenize ssplit".split()) as client:
    ...     ann = client.annotate(text)
    ... 
    [pool-1-thread-4] INFO CoreNLP - [/0:0:0:0:0:0:0:1:55475] API call w/annotators tokenize,ssplit
    Chris wrote a simple sentence that he parsed with Stanford CoreNLP.
    >>> sentence = ann.sentence[0]
    >>> 
    >>> [token.word for token in sentence.token]
    ['Chris', 'wrote', 'a', 'simple', 'sentence', 'that', 'he', 'parsed', 'with', 'Stanford', 'CoreNLP', '.']