I am finding frequency distribution for some words in different genres of Brown corpus.
My Code :
import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories = genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions = genres, samples = modals)
Output for above code :
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
science_fiction 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
But when i replace 'samples' by 'sample' in the last line of above code . It gives FreqDist for every word in corpus .
I don't know the difference between 'sample' and 'samples' ?
Thank you .
cfd.tabulate()
simply ignores any keyword argument that's not referenced in its implementation. That's why sample=models
still produces a full table for the FreqDist. If you leave it out altogether, the effect should be the same.
This behavior is not NLTK-specific, but holds for any Python function/method that accepts arbitrary argument lists. I would recommend reading the Python Tutorial section about this, I find it very clear.