Search code examples
pythonmatplotlibnltk

Displaying y-axis in percentage format when plotting conditional frequency distibution


When plotting conditional frequency distribution for some set of words in text corpora, y-axis is displayed as counts, not percentages

I follow the code outlined in "Natural Language Processing with Python" by Steven Bird, Ewan Klein & Edward Loper to display the frequency distribution of words for different languages of UDHR in Jupyter Notebooks.

from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist((lang, len(word)) for lang in languages\
                                                 for word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative = True)

I expect y-axis to display cumulative percentage (as in the book), but instead y-axis shows cumulative counts. Please advise on how to turn y-axis into cumulative percentages.


Solution

  • Here is a solution which will provide the output you are looking for:

    inltk.download('udhr')
    import pandas as pd
    from nltk.corpus import udhr
    
    languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
    
    cfd = nltk.ConditionalFreqDist(
        (lang, len(word))
        for lang in languages
        for word in udhr.words(lang + '-Latin1'))
    
    def plot_freq(lang):
        max_length = max([len(word) for word in udhr.words(lang + '-Latin1')])
        eng_freq_dist = {}
    
        for i in range(max_length + 1):
            eng_freq_dist[i] = cfd[lang].freq(i)
    
        ed = pd.Series(eng_freq_dist, name=lang)
    
        ed.cumsum().plot(legend=True, title='Cumulative Distribution of Word Lengths')
    

    Then we can use this new function to plot all the languages provided in the example:

    for lang in languages:
    plot_freq(lang)
    

    In this thread we are disscusing examples taken from the NLTK book Chapter 2.