I have a custom corpora that created with data which i need to do some classification. I have the dataset in a same format that movie_reviews corpora contains. According to nltk documentation i use following code to access to movie_reviews corpora. Is there anyway to add any custom corpora to nltk_data/corpora directory and access that corpora as the same way we access existing corpora.
import nltk
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
While you could hack the nltk to make your corpus look like a built-in nltk corpus, that's the wrong way to go about it. The nltk
provides a rich collection of "corpus readers" that you can use to read your corpora from wherever you keep them, without moving them to the nltk_data
directory or hacking the nltk
source. The nltk's own corpora use the same corpus readers behind the scenes, so your reader will have all the methods and behavior of equivalent built-in corpora.
Let's see how the movie_reviews
corpus is defined in nltk/corpora/__init__.py
:
movie_reviews = LazyCorpusLoader(
'movie_reviews', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*',
encoding='ascii')
You can ignore the LazyCorpusLoader
part; it's for providing corpora that your program will most likely never use. The rest shows that the movie review corpus is read with a CategorizedPlaintextCorpusReader
, that its files all end in .txt
, and that the reviews are sorted into categories through being in the subdirectories pos
and neg
. Finally, the corpus encoding is ascii. So define your own corpus like this (changing values as needed):
mycorpus = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
r"/home/user/path/to/my_corpus",
r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*',
encoding="ascii")
That's it; you can now call mycorpus.words()
, mycorpus.sents(categories="neg")
, etc., just as if this was a corpus provided by the nltk.