Search code examples
pythonstanford-nlpsentiment-analysis

Stanford nlp for python


All I want to do is find the sentiment (positive/negative/neutral) of any given string. On researching I came across Stanford NLP. But sadly its in Java. Any ideas on how can I make it work for python?


Solution

  • Use py-corenlp

    Download Stanford CoreNLP

    The latest version at this time (2020-05-25) is 4.0.0:

    wget https://nlp.stanford.edu/software/stanford-corenlp-4.0.0.zip https://nlp.stanford.edu/software/stanford-corenlp-4.0.0-models-english.jar
    

    If you do not have wget, you probably have curl:

    curl https://nlp.stanford.edu/software/stanford-corenlp-4.0.0.zip -O https://nlp.stanford.edu/software/stanford-corenlp-4.0.0-models-english.jar -O
    

    If all else fails, use the browser ;-)

    Install the package

    unzip stanford-corenlp-4.0.0.zip
    mv stanford-corenlp-4.0.0-models-english.jar stanford-corenlp-4.0.0
    

    Start the server

    cd stanford-corenlp-4.0.0
    java -mx5g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000
    

    Notes:

    1. timeout is in milliseconds, I set it to 10 sec above. You should increase it if you pass huge blobs to the server.
    2. There are more options, you can list them with --help.
    3. -mx5g should allocate enough memory, but YMMV and you may need to modify the option if your box is underpowered.

    Install the python package

    The standard package

    pip install pycorenlp
    

    does not work with Python 3.9, so you need to do

    pip install git+https://github.com/sam-s/py-corenlp.git
    

    (See also the official list).

    Use it

    from pycorenlp import StanfordCoreNLP
    
    nlp = StanfordCoreNLP('http://localhost:9000')
    res = nlp.annotate("I love you. I hate him. You are nice. He is dumb",
                       properties={
                           'annotators': 'sentiment',
                           'outputFormat': 'json',
                           'timeout': 1000,
                       })
    for s in res["sentences"]:
        print("%d: '%s': %s %s" % (
            s["index"],
            " ".join([t["word"] for t in s["tokens"]]),
            s["sentimentValue"], s["sentiment"]))
    

    and you will get:

    0: 'I love you .': 3 Positive
    1: 'I hate him .': 1 Negative
    2: 'You are nice .': 3 Positive
    3: 'He is dumb': 1 Negative
    

    Notes

    1. You pass the whole text to the server and it splits it into sentences. It also splits sentences into tokens.
    2. The sentiment is ascribed to each sentence, not the whole text. The mean sentimentValue across sentences can be used to estimate the sentiment of the whole text.
    3. The average sentiment of a sentence is between Neutral (2) and Negative (1), the range is from VeryNegative (0) to VeryPositive (4) which appear to be quite rare.
    4. You can stop the server either by typing Ctrl-C at the terminal you started it from or using the shell command kill $(lsof -ti tcp:9000). 9000 is the default port, you can change it using the -port option when starting the server.
    5. Increase timeout (in milliseconds) in server or client if you get timeout errors.
    6. sentiment is just one annotator, there are many more, and you can request several, separating them by comma: 'annotators': 'sentiment,lemma'.
    7. Beware that the sentiment model is somewhat idiosyncratic (e.g., the result is different depending on whether you mention David or Bill).

    PS. I cannot believe that I added a 9th answer, but, I guess, I had to, since none of the existing answers helped me (some of the 8 previous answers have now been deleted, some others have been converted to comments).