Search code examples
python-3.xnlpnltkcorpuspos-tagger

How to find most frequent noun following the word 'the'?


from nltk.corpus import brown

tagged = brown.tagged_words(tagset='universal')

I understand that to find the most frequent word following 'the' is done like so

cfd3 = nltk.ConditionalFreqDist(nltk.bigrams(brown.words())

cfd3['the'].max()

however, how would one go about finding the most frequent noun following the word 'the'


Solution

  • Make a FreqDist that counts only the nouns that follow the word "the".

    The Brown corpus has a very rich tagset, so let's simplify things by asking for the simplified "universal" tagset. All nouns are now tagged "NOUN".

    >>> noundist = nltk.FreqDist(w2 for ((w1, t1), (w2, t2)) in 
                nltk.bigrams(brown.tagged_words(tagset="universal"))
                if w1.lower() == "the" and t2 == "NOUN")
    >>> noundist.most_common(10)
    [('world', 346), ('time', 250), ('way', 236), ('end', 206), ('fact', 194), ('state', 190), 
    ('man', 176), ('door', 172), ('house', 152), ('city', 127)]
    

    The comprehension unpacks the two word, tag tuples that form the bigram: (w1, t1), (w2, t2); checks that the first word (lowercased) is "the" and the second is tagged "NOUN"; and if so, passes the second word (so, w2 only) to be counted by the FreqDist.