Search code examples
pythonnltkcorpus

Get the category of a given sentence from a categorized corpus using NLTK


Using NLTK, I've created a categorized corpus of approximately 100k sentences split into 36 categories.

I can access the sentences of a particular category like this:

romantic_comedies_sents = (my_corpus.sents(categories='romantic_comedies'))

However, given a sentence in the form of a tokenized list such as ["You", "had", "me", "at", "hello"] I'd like to efficiently identify the categories in which it occurs. Is there a fast way of doing this?

I've tried creating and using a dictionary with sentences as keys and categories as values, but creating this dictionary takes a long time on my computer (especially in comparison with NLTK's built in methods) and I was wondering if there's a better way of doing it, preferably using NLTK.

Ultimately I'm trying to end up with this structure for every sentence:

(["You", "had", "me", "at", "hello"], set("romantic_comedies"))

Thanks in advance for any help.


Solution

  • NLTK's corpus reader's sents() function returns a list of lists. This is not a particularly efficient structure for looping through to create a dictionary mapping sentences to categories.

    The answer was to convert the sentences to tuples and the list of sentences to a set (I only needed distinct sentences).

    Once converted the loops used to create the dictionary mapping sentences to categories were finished in 18 seconds rather than taking all night.