Search code examples
pythonnlpnltksentiment-analysiscorpus

How to create a corpus for sentiment analysis in NLTK?


I'm looking to use my own created corpus within Visual Studio Code for MacOSX; I have read probably a hundred forums and I can't wrap my head around what I'm doing wrong as I'm pretty new to programming.

This question seems to be the closes thing I can find to what I need to do; however, I am unaware of how to do the following:

"on a Mac it would be in ~/nltk_data/corpora, for instance. And it looks like you also have to append your new corpus to the __init__.py within .../site-packages/nltk/corpus/."

When answering, please be aware I am using Homebrew and don't want to permanently disable using another path if I need to use a stock NLTK corpora data set as well within the same coding.

If needed, I can post my attempt at coding using "PlaintextCorpusReader" along with the provided traceback below, although I would rather not have to use PlaintextCorpusReader at all for seamless use and would rather just use a simple copy+paste for .txt files into an appropriate location I wish to use in accordance with the append coding.

Thank you.

Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 42, in <module>
    short_pos = open("short_reviews/pos.txt", "r").read
IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'


EDIT:


Thank you for your responses.

I have taken your advice and moved the folder out of NLTK's corpora.

I've been doing some experimenting with my folder location and I've gotten different tracebacks.

If you are saying the best way to do it is with PlaintextCorpusReader then so be it; however, maybe for my application I'd want to use CategorizedPlaintextCorpusReader?

sys.argv is definitely not what I meant, so I can read up on that later.

First, here is my code without my attempt to use PlaintextCorpusReader which results in the above traceback when the folder "short_reviews" containing the pos.txt and neg.txt files is outside of the NLP folder:

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

# def main():
#     file = open("short_reviews/pos.txt", "r")
#     short_pos = file.readlines()
#     file.close

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)

However, when I move the folder "short_reviews" containing the text files into the NLP folder using the same code as above but without the use of PlaintextCorpusReader the following occurs:

Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 47, in <module>
    for r in short_pos.split('\n'):
AttributeError: 'builtin_function_or_method' object has no attribute 'split'

When I move the folder "short_reviews" containing the text files into the NLP folder using the code below with the use of PlaintextCorpusReader the following Traceback occurs:

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

from nltk.corpus import PlaintextCorpusReader
corpus_root = 'short_reviews'
word_lists = PlaintextCorpusReader(corpus_root, '*')
wordlists.fileids()


class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

# def main():
#     file = open("short_reviews/pos.txt", "r")
#     short_pos = file.readlines()
#     file.close

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append((r, "pos"))

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)


Traceback (most recent call last):
  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata2", line 18, in <module>
    word_lists = PlaintextCorpusReader(corpus_root, '*')
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/plaintext.py", line 62, in __init__
    CorpusReader.__init__(self, root, fileids, encoding)
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/api.py", line 87, in __init__
    fileids = find_corpus_fileids(root, fileids)
  File "/Library/Python/2.7/site-packages/nltk/corpus/reader/util.py", line 763, in find_corpus_fileids
    if re.match(regexp, prefix+fileid)]
  File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 141, in match
    return _compile(pattern, flags).match(string)
  File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
error: nothing to repeat

Solution

  • The answer you refer to contains some very poor (or rather, inapplicable) advice. There is no reason to place your own corpus in nltk_data, or to hack nltk.corpus.__init__.py to load it like a native corpus. In fact, do not do these things.

    You should use PlaintextCorpusReader. I don't understand your reluctance to do so, but if your files are plain text, it's the right tool to use. Supposing you have a folder NLP/bettertrainingdata, you can build a reader that will load all .txt files in this folder like this:

    myreader = nltk.corpus.reader.PlaintextCorpusReader(r"NLP/bettertrainingdata", r".*\.txt")
    

    If you add new files to the folder, the reader will find and use them. If what you want is to be able to use your script with other folders, then just do so-- you don't need a different reader, you need to learn about sys.argv. If you are after a categorized corpus with pos.txt and neg.txt, then you need a CategorizedPlaintextCorpusReader (which see). If it's something else yet that you want, then please edit your question to explain what you are trying to do.