Search code examples
pythonpandasmatplotlibgaussiankernel-density

Scaleable Python normal distribution from pandas DataFrame


I have a pandas dataframe (code below) that has the mean and std deviation by day of week and quarter. What i'd like to do is extract each mean and std deviation by day of week, create a random normal sample from those two values then plot it.

np.random.seed(42)
day_of_week=['mon', 'tues', 'wed', 'thur', 'fri', 'sat','sun']
year=[2017]
qtr=[1,2,3,4]
mean=np.random.uniform(5,30,len(day_of_week)*len(qtr))
std=np.random.uniform(1,10,len(day_of_week)*len(qtr))

dat=pd.DataFrame({'year':year*(len(day_of_week)*len(qtr)),
             'qtr':qtr*len(day_of_week),
             'day_of_week':day_of_week*len(qtr),
             'mean':mean,
             'std': std})
dowuq=dat.day_of_week.unique()

Right now i have a solution to the above which works but this method isn't very scaleable. If I wanted to add in more and more columns i.e another year or break it out by week this would not but efficient. I'm fairly new to python so any help is appreciated.

Code that works but not scaleable:

plt.style.use('fivethirtyeight')
for w in dowuq:
    datsand=dat[dat['day_of_week']==''+str(w)+''][0:4]
    mu=datsand.iloc[0]['mean']
    sigma=datsand.iloc[0]['std']
    mu2=datsand.iloc[1]['mean']
    sigma2=datsand.iloc[1]['std']
    mu3=datsand.iloc[2]['mean']
    sigma3=datsand.iloc[2]['std']
    mu4=datsand.iloc[3]['mean']
    sigma4=datsand.iloc[3]['std']             
    s1=np.random.normal(mu, sigma, 2000)
    s2=np.random.normal(mu2, sigma2, 2000)
    s3=np.random.normal(mu3, sigma3, 2000)
    s4=np.random.normal(mu4, sigma4, 2000)
    sns.kdeplot(s1, bw='scott', label='Q1')
    sns.kdeplot(s2, bw='scott', label='Q2')
    sns.kdeplot(s3, bw='scott', label='Q3')
    sns.kdeplot(s4, bw='scott', label='Q4')
    plt.title(''+str(w)+' in 2017')
    plt.ylabel('Density')
    plt.xlabel('Random')
    plt.xticks(rotation=15)
    plt.show()

Solution

  • You should probably be using groupby, which allows you to group a dataframe. For the time being we group on 'day' only, but you could extend this in future if required.

    We can also change to using iterrows to loop over all of the listed rows:

    import numpy as np
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    np.random.seed(42)
    day_of_week = ['mon', 'tues', 'wed', 'thur', 'fri', 'sat', 'sun']
    year = [2017]
    qtr = [1, 2, 3, 4]
    mean = np.random.uniform(5, 30, len(day_of_week) * len(qtr))
    std = np.random.uniform(1, 10, len(day_of_week) * len(qtr))
    
    dat = pd.DataFrame({'year': year * (len(day_of_week) * len(qtr)),
                        'qtr': qtr * len(day_of_week),
                        'day_of_week': day_of_week * len(qtr),
                        'mean': mean,
                        'std': std})
    
    # Group by day of the week
    for day, values in dat.groupby('day_of_week'):
        # Loop over rows for each day of the week
        for i, r in values.iterrows():
            cur_dist = np.random.normal(r['mean'], r['std'], 2000)
            sns.kdeplot(cur_dist, bw='scott', label='{}_Q{}'.format(day, r['qtr']))
        plt.title('{} in 2017'.format(day))
        plt.ylabel('Density')
        plt.xlabel('Random')
        plt.xticks(rotation=15)
        plt.show()
        plt.clf()