Search code examples
pythonnlpnltkcorpus

"Cloning" a corpus in NLTK?


I'm attempting to create my own corpus in NLTK. I've been reading some of the documentation on this and it seems rather complicated... all I wanted to do is "clone" the movie reviews corpus but with my own text. Now, I know I can just change files in the move reviews corpus to my own... but that limits me to working with just one such corpus at a time (ie. I'd have to continually be swapping files). is there any way i could just clone the movie reviews corpus?

thanks Alex


Solution

  • The movie reviews are read with the CategorizedPlaintextCorpusReader class. Use it directly to load your corpus. The following should work for an exact copy of the movies corpus:

    mr = CategorizedPlaintextCorpusReader(path_to_your_reviews, r'(?!\.).*\.txt',
            cat_pattern=r'(neg|pos)/.*')
    

    Whatever maches inside cat_pattern are the categories: In this case, neg and pos. If your corpus has different categories (e.g., movie genres rather than positive/negative evaluations), change the directory structure and adjust the cat_pattern parameter to match.

    PS. For categorized corpora with different structure, the nltk offers a wealth of ways to specify the category; read the documentation of CategorizedPlaintextCorpusReader.