I have created my own corpus, similar to the movie_reviews corpus in nltk (categorized by neg|pos.)
Within the neg and pos folders are txt files.
Code:
from nltk.corpus import CategorizedPlaintextCorpusReader
mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*')
When I try to read or interact with one of these files, I am unable to.
e.g. len(mr.categories())
runs, but does not return anything:
>>>
I have read multiple documents and questions on here regarding custom categorized corpus', but I am still unable to use them.
Full code:
import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader
mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*')
len(mr.categories())
I eventually want to be able to preform a naive bayes algorithm against my data but I am unable to read the content.
Paths:
C:\mycorpus\pos
C:\mycorpus\neg
Within the pos file is a 'cv.txt' and the neg contains a 'example.txt'
I am using Linux, and the following modification to your code (with toy corpus files) works correctly for me:
import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader
import os
mr = CategorizedPlaintextCorpusReader(
'/home/ely/programming/nltk-test/mycorpus',
r'(?!\.).*\.txt',
cat_pattern=os.path.join(r'(neg|pos)', '.*')
)
print(len(mr.categories()))
This suggests it is a problem with the cat_pattern
string using /
as a file system delimiter when you're on a Windows system.
Using os.path.join
as in my example, or pathlib
if using Python 3, would be a good way to solve it so it is OS-agnostic and you don't trip up with the regular expression escape slashes mixed with file system delimiters.
In fact you may way to use this approach for all of the cases of file system delimiters in your argument strings, and it's generally a good habit to get in for making code portable and avoiding strange string munging tech debt.