Search code examples
pythonpandasmatplotlibseaborncdf

Plot CDF of columns from a CSV file using pandas


I want to plot CDF value of columns from a CSV file using pandas as follows:

I have tried some codes, but they are not reporting the correct plot. Can you help with an easy way?

df = pd.read_csv('pathfile.csv')
def compute_distrib(df, col):
    stats_df = df.groupby(col)[col].agg('count')\
                 .pipe(pd.DataFrame).rename(columns={col: 'frequency'})
    
    # PDF
    stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
    
    # CDF
    stats_df['CDF'] = stats_df['pdf'].cumsum()
    
    # modifications
    stats_df = stats_df.reset_index()\
                       .rename(columns={col:"X"})
    stats_df[" "] = col
    return stats_df

cdf = []
for col in ['1','2','3','4']: 
    cdf.append(compute_distrib(df, col))
cdf = pd.concat(cdf, ignore_index=True)

import seaborn as sns

sns.lineplot(x=cdf["X"],
             y=cdf["CDF"],
             hue=cdf[" "]);

Solution

  • Due to the lack of runnable code on your post, I created my own code for plotting the CDF of the columns of a dataframe df:

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from itertools import accumulate
    
    
    # GENERATE EXAMPLE DATA
    df = pd.DataFrame()
    df['x1'] = np.random.uniform(-1,1, size=1000)
    df['x2'] = df['x1'] + np.random.uniform(-1,1, size=1000)
    df['x3'] = df['x2'] + np.random.uniform(-1,1, size=1000)
    df['x4'] = df['x3'] + np.random.uniform(-1, 1, size=1000)
    
    # START A PLOT
    fig,ax = plt.subplots()
    
    for col in df.columns:
    
      # SKIP IF IT HAS ANY INFINITE VALUES
      if not all(np.isfinite(df[col].values)):
        continue
    
      # USE numpy's HISTOGRAM FUNCTION TO COMPUTE BINS
      xh, xb = np.histogram(df[col], bins=60, normed=True)
    
      # COMPUTE THE CUMULATIVE SUM WITH accumulate
      xh = list(accumulate(xh))
      # NORMALIZE THE RESULT
      xh = np.array(xh) / max(xh)
    
      # PLOT WITH LABEL
      ax.plot(xb[1:], xh, label=f"$CDF$({col})")
    ax.legend()
    plt.title("CDFs of Columns")
    plt.show()
    

    The resulting plot from this code is below:

    CDF plot

    To put in your own data, just replace the # GENERATE EXAMPLE DATA section with df = pd.read_csv('path/to/sheet.csv')

    Let me know if anything in the example is unclear to you or if it needs more explanation.