Search code examples
pythonpandasseriescdf

Plotting CDF of a pandas series in python


Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF (cumulative distribution function).


Solution

  • In case you are also interested in the values, not just the plot.

    import pandas as pd
    
    # If you are in jupyter
    %matplotlib inline
    

    This will always work (discrete and continuous distributions)

    # Define your series
    s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
    df = pd.DataFrame(s)
    
    # Get the frequency, PDF and CDF for each value in the series
    
    # Frequency
    stats_df = df \
    .groupby('value') \
    ['value'] \
    .agg('count') \
    .pipe(pd.DataFrame) \
    .rename(columns = {'value': 'frequency'})
    
    # PDF
    stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
    
    # CDF
    stats_df['cdf'] = stats_df['pdf'].cumsum()
    stats_df = stats_df.reset_index()
    stats_df
    

    enter image description here

    # Plot the discrete Probability Mass Function and CDF.
    # Technically, the 'pdf label in the legend and the table the should be 'pmf'
    # (Probability Mass Function) since the distribution is discrete.
    
    # If you don't have too many values / usually discrete case
    stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
    

    enter image description here

    Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:

    # Define your series
    s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
    
    # ... all the same calculation stuff to get the frequency, PDF, CDF
    
    # Plot
    stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
    

    enter image description here

    For continuous distributions only

    Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).

    In this case, a percent rank can be used to get to the cdf directly.

    Use your best judgment when taking this kind of shortcut! :)

    # Define your series
    s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
    df = pd.DataFrame(s)
    
    # Get to the CDF directly
    df['cdf'] = df.rank(method = 'average', pct = True)
    
    # Sort and plot
    df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
    

    enter image description here