Search code examples
pythonnltklexiconvader

Python VADER lexicon Structure for sentiment analysis


I am using the VADER sentiment lexicon in Python's nltk library to analyze text sentiment. This lexicon does not suit my domain well, and so I wanted to add my own sentiment scores to various words. So, I got my hands on the lexicon text file (vader_lexicon.txt) to do just that. However, I do not understand the architecture of this file well. For example, a word like obliterate will have the following data in the text file: obliterate -2.9 0.83066 [-3, -4, -3, -3, -3, -3, -2, -1, -4, -3]

Clearly the -2.9 is the average of sentiment scores in the list. But what does the 0.83066 represent?

Thanks!


Solution

  • According to the VADER source code, only the first number on each line is used. The rest of the line is ignored:

    for line in self.lexicon_full_filepath.split('\n'):
        (word, measure) = line.strip().split('\t')[0:2] # Here!
        lex_dict[word] = float(measure)