Search code examples
pythonnlpnltk

ConditionalFreqDist to find most frequent POS tags for words


I am trying to fidn the most frequent POS tag for words in the dataset but struggling with the ConditionalFrewDist part.

import nltk
tw = nltk.corpus.brown.tagged_words()

train_idx = int(0.8*len(tw))
training_set = tw[:train_idx]
test_set = tw[train_idx:]

words= list(zip(*training_set))[0]

from nltk import ConditionalFreqDist
ofd= ConditionalFreqDist(word for word in list(zip(*training_set))[0])

tags= list(zip(*training_set))[1]
ofd.tabulate(conditions= words, samples= tags)

ValueError: too many values to unpack (expected 2)


Solution

  • As you might read in documents the ConditionalFreqDist helps you to calculate

    A collection of frequency distributions for a single experiment run under different conditions.

    The only thing you must provide, is the list of items and conditions which can be translated (in this problem) to words and corresponding POS tags. The code with minimal changes would look like this and would calculate distributions for the whole corpus but tabulate the results for the first 10th items and conditions(preventing a crash):

    import nltk
    from nltk import ConditionalFreqDist
    
    tw = nltk.corpus.brown.tagged_words()
    train_idx = int(0.8*len(tw))
    training_set = tw[:train_idx]
    test_set = tw[train_idx:]
    words= list(zip(*training_set))[0] # items
    tags= list(zip(*training_set))[1] # conditions
    
    ofd= ConditionalFreqDist((tag, word) for tag, word in zip(words, tags)) # simple comprehension pattern in python
    ofd.tabulate(conditions= words[:10], samples= tags[:10])