Search code examples
pythonplotlydata-visualizationplotly-python

Python Plotly CDF with Frequency DIstribution Data


How do you make a CDF plot with frequency distribution data in a Pandas DataFrame using Plotly? Suppose the following toy data

value   freq    
1       3
2       2
3       1

All of the examples show how to do it with raw data that looks like:

value
1
1
1
2
2
3

I am able to do it with Pandas .plot like so (but I would prefer to do the same with Plotly):

stats_df = df
stats_df['pdf'] = stats_df['count'] / sum(stats_df['count'])

# calculate CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()

# plot
stats_df.plot(x = 'n_calls', 
              y = ['pdf', 'cdf'], 
              logx = True,
              kind = 'line',
              grid = True)

If you would like to demonstrate with a toy dataset, here's one: https://raw.githubusercontent.com/plotly/datasets/master/2010_alcohol_consumption_by_country.csv

References:

https://plotly.com/python/v3/discrete-frequency/

https://plotly.com/python/distplot/


Solution

  • It is not possible to build a CDF in the Plotly.

    On Plotly, only PDF and a histogram can be plotted (see below for alcohol sample).

    enter image description here

    The code for the graph above looks like this:

    import plotly.figure_factory as ff
    import pandas as pd
    
    data = pd.read_csv(
        'https://raw.githubusercontent.com/plotly/datasets/master/2010_alcohol_consumption_by_country.csv')
    
    x = data['alcohol'].values.tolist()
    
    group_labels = ['']
    fig = ff.create_distplot([x], group_labels,
                             bin_size=.25, show_rug=False)
    fig.show()
    
    

    If you need exactly the CDF, then use third-party libraries for data preparation. In the example below, I am using Numpy.

    enter image description here

    The code for the graph above looks like this:

    import plotly.graph_objs as go
    import numpy as np
    import pandas as pd
    
    data = pd.read_csv(
        'https://raw.githubusercontent.com/plotly/datasets/master/2010_alcohol_consumption_by_country.csv')
    
    x = data['alcohol'].values.tolist()
    
    hist, bin_edges = np.histogram(x, bins=100, density=True)
    cdf = np.cumsum(hist * np.diff(bin_edges))
    fig = go.Figure(data=[
        go.Bar(x=bin_edges, y=hist, name='Histogram'),
        go.Scatter(x=bin_edges, y=cdf, name='CDF')
    ])
    fig.show()
    

    Note that the CDF is a broken line. This is due to the fact that this is not an approximate function for the unknown distribution. To get a smooth function, you need to know the distribution law.