Search code examples
nlpnltksentiment-analysiswordnetsenti-wordnet

Classifying negative and positive words in large files?


I am trying to get the count of positive and negative in a very large file. I only need a primitive approach(that does not take ages). I have tried sentiwordnet but keep getting a IndexError: list index out of range, which I think it's due to the words not being listed in wordnet dictionary. The text contains a lot of typos and 'non-words'.

If someone could give any suggestion, I would be very grateful!


Solution

  • It all depends on what your data is like and what is the final objective of your task. You need to give us a little bit more detailed description of your project but, in general, here are your options: - Make your own sentiment analysis dictionary: I really doubt this is what you want to do since it takes a lots of time and effort but if your data is simple enough it's doable. - Clean your data: if your tokens aren't in senti-wordnet because there's too much noise and badly spelled words, then try to correct them before passing them through wordnet, it will at least limit the number of errors you'll get. - Use a senti-wordnet alternative: accorded, there aren't that many good ones but you can always try sentiment_classifier or nltk's sentiment if you're using python (which by the looks of your error seems like you are). - Classify only what you can: this is what I would recommend. If the word is not in senti-wordnet, then move on to the next one. Just catch the error (try: ... except IndexError: pass) and try to infer what the general sentiment of the data is by counting the sentiment words you actually catch.

    PS: We would need to see your code to be sure but I think there's another reason why you're getting an IndexError. If the word was not in senti-wordnet you would be getting a KeyError, but it also depends on how you coded your function.

    Good luck and I hope it was helpful.