Search code examples
pythonpandasvisualizationjupyternoise

A smart way to get rid of insignificant data in Pandas or its visualization engine for PieChart?


There can be a lot of insignificant edge cases and data noise. I want to get a pie chart (based on Bokeh or any other open source, free plot library) that would allow to see data like this:

type size
 S    1
 V    2
 T    200
 ...
 Z    3333

Reduced to its core, with insignificant (< 1% type size) noise put into new "other" type.

1) Can Pandas do it on its own? How? 2) Does some visualization already come with such feature integrated?


Solution

  • Consider the pandas series a with counts of values

    import pandas as pd
    import numpy as np
    from string import ascii_uppercase
    
    np.random.seed([3,1415])
    types = np.random.permutation(list(ascii_uppercase))
    r = np.arange(1, 27)
    r = r / r.sum()
    s = np.random.choice(types, 10000, p=r)
    
    a = pd.value_counts(s)
    
    a.plot.pie(colormap='jet');
    

    enter image description here


    Now group all groups with representation less than 3% into one group other

    n = a / a.sum()
    
    f = n < .03
    
    a[~f].append(pd.Series(a[f].sum(), ['other'])).plot.pie(colormap='jet')
    

    enter image description here