Search code examples
pythonpython-2.7matplotlibhistogramranking

How to generate a word frequency histogram, where bars are ordered according to their height


I have a long list of words, and I want to generate a histogram of the frequency of each word in my list. I was able to do that in the code below:

import csv
from collections import Counter
import numpy as np

word_list = ['A','A','B','B','A','C','C','C','C']

counts = Counter(merged)

labels, values = zip(*counts.items())

indexes = np.arange(len(labels))

plt.bar(indexes, values)
plt.show()

It doesn't, however, display the bins by rank (i.e. by frequency, so highest frequency is first bin on the left and so on), even though when I print counts it orders them for me Counter({'C': 4, 'A': 3, 'B': 2}). How could I achieve that?


Solution

  • You can achieve the desired output by sorting your data first and then pass the ordered arrays to bar; below I use numpy.argsort for that. The plot then looks as follows (I also added the labels to the bar):

    enter image description here

    Here is the code that produces the plot with a few inline comments:

    from collections import Counter
    import numpy as np
    import matplotlib.pyplot as plt
    
    word_list = ['A', 'A', 'B', 'B', 'A', 'C', 'C', 'C', 'C']
    
    counts = Counter(word_list)
    
    labels, values = zip(*counts.items())
    
    # sort your values in descending order
    indSort = np.argsort(values)[::-1]
    
    # rearrange your data
    labels = np.array(labels)[indSort]
    values = np.array(values)[indSort]
    
    indexes = np.arange(len(labels))
    
    bar_width = 0.35
    
    plt.bar(indexes, values)
    
    # add labels
    plt.xticks(indexes + bar_width, labels)
    plt.show()
    

    In case you want to plot only the first n entries, you can replace the line

    counts = Counter(word_list)
    

    by

    counts = dict(Counter(word_list).most_common(n))
    

    In the case above, counts would then be

    {'A': 3, 'C': 4}
    

    for n = 2.

    If you like to remove the frame of the plot and label the bars directly, you can check this post.