from nltk.corpus import brown
tagged = brown.tagged_words(tagset='universal')
I understand that to find the most frequent word following 'the' is done like so
cfd3 = nltk.ConditionalFreqDist(nltk.bigrams(brown.words())
cfd3['the'].max()
however, how would one go about finding the most frequent noun following the word 'the'
Make a FreqDist
that counts only the nouns that follow the word "the".
The Brown corpus has a very rich tagset, so let's simplify things by asking for the simplified "universal" tagset. All nouns are now tagged "NOUN"
.
>>> noundist = nltk.FreqDist(w2 for ((w1, t1), (w2, t2)) in
nltk.bigrams(brown.tagged_words(tagset="universal"))
if w1.lower() == "the" and t2 == "NOUN")
>>> noundist.most_common(10)
[('world', 346), ('time', 250), ('way', 236), ('end', 206), ('fact', 194), ('state', 190),
('man', 176), ('door', 172), ('house', 152), ('city', 127)]
The comprehension unpacks the two word, tag
tuples that form the bigram: (w1, t1), (w2, t2)
; checks that the first word (lowercased) is "the" and the second is tagged "NOUN"; and if so, passes the second word (so, w2
only) to be counted by the FreqDist
.