Using NLTK, I've created a categorized corpus of approximately 100k sentences split into 36 categories.
I can access the sentences of a particular category like this:
romantic_comedies_sents = (my_corpus.sents(categories='romantic_comedies'))
However, given a sentence in the form of a tokenized list
such as ["You", "had", "me", "at", "hello"]
I'd like to efficiently identify the categories in which it occurs. Is there a fast way of doing this?
I've tried creating and using a dictionary with sentences as keys and categories as values, but creating this dictionary takes a long time on my computer (especially in comparison with NLTK's built in methods) and I was wondering if there's a better way of doing it, preferably using NLTK.
Ultimately I'm trying to end up with this structure for every sentence:
(["You", "had", "me", "at", "hello"], set("romantic_comedies"))
Thanks in advance for any help.
NLTK's corpus reader's sents() function returns a list of lists. This is not a particularly efficient structure for looping through to create a dictionary mapping sentences to categories.
The answer was to convert the sentences to tuples and the list of sentences to a set (I only needed distinct sentences).
Once converted the loops used to create the dictionary mapping sentences to categories were finished in 18 seconds rather than taking all night.