Search code examples
pythonpandasdataframet-test

T-test for groups within a Pandas dataframe for a unique id


I have the following dataframe and I am performing a t-test between all days of the weekday and all days of a weekend in a month for every ID.

> +-----+------------+-----------+---------+-----------+ | id  | usage_day  | dow       | tow     | daily_avg |
> +-----+------------+-----------+---------+-----------+ | c96 | 01/09/2020 | Tuesday   | week    | 393.07    |
> +-----+------------+-----------+---------+-----------+ | c96 | 02/09/2020 | Wednesday | week    | 10.38     |
> +-----+------------+-----------+---------+-----------+ | c96 | 03/09/2020 | Thursday  | week    | 429.35    |
> +-----+------------+-----------+---------+-----------+ | c96 | 04/09/2020 | Friday    | week    | 156.20    |
> +-----+------------+-----------+---------+-----------+ | c96 | 05/09/2020 | Saturday  | weekend | 346.22    |
> +-----+------------+-----------+---------+-----------+ | c96 | 06/09/2020 | Sunday    | weekend | 106.53    |
> +-----+------------+-----------+---------+-----------+ | c96 | 08/09/2020 | Tuesday   | week    | 194.74    |
> +-----+------------+-----------+---------+-----------+ | c96 | 10/09/2020 | Thursday  | week    | 66.30     |
> +-----+------------+-----------+---------+-----------+ | c96 | 17/09/2020 | Thursday  | week    | 163.84    |
> +-----+------------+-----------+---------+-----------+ | c96 | 18/09/2020 | Friday    | week    | 261.81    |
> +-----+------------+-----------+---------+-----------+ | c96 | 19/09/2020 | Saturday  | weekend | 410.30    |
> +-----+------------+-----------+---------+-----------+ | c96 | 20/09/2020 | Sunday    | weekend | 266.28    |
> +-----+------------+-----------+---------+-----------+ | c96 | 23/09/2020 | Wednesday | week    | 346.18    |
> +-----+------------+-----------+---------+-----------+ | c96 | 24/09/2020 | Thursday  | week    | 20.67     |
> +-----+------------+-----------+---------+-----------+ | c96 | 25/09/2020 | Friday    | week    | 222.23    |
> +-----+------------+-----------+---------+-----------+ | c96 | 26/09/2020 | Saturday  | weekend | 449.84    |
> +-----+------------+-----------+---------+-----------+ | c96 | 27/09/2020 | Sunday    | weekend | 438.47    |
> +-----+------------+-----------+---------+-----------+ | c96 | 28/09/2020 | Monday    | week    | 10.44     |
> +-----+------------+-----------+---------+-----------+ | c96 | 29/09/2020 | Tuesday   | week    | 293.59    |
> +-----+------------+-----------+---------+-----------+ | c96 | 30/09/2020 | Wednesday | week    | 194.49    |
> +-----+------------+-----------+---------+-----------+

My script is as follows, but it is unfortunately too slow and not the pandas way of doing things. How I could do it more efficiently?

    from scipy.stats import ttest_ind, ttest_ind_from_stats

    p_val = []
    stat_flag = []
    all_ids = df.id.unique()
    alpha = 0.05
    print(len(all_ids))
    for id in all_ids:
        t = df[df.id ==  id]
        group1 = t[t.tow == 'week']
        group2 = t[t.tow == 'weekend']
        t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
        if p_value_ttest < alpha:
           p_val.append(p_value_ttest)
           stat_flag.append(1)
        else: 
           p_val.append(p_value_ttest)
           stat_flag.append(0)

p-val gives the p-values for every id.


Solution

  • Dataset

    Based on dataset you provided:

    import io
    from scipy import stats
    import pandas as pd
    
    s = """id|usage_day|dow|tow|daily_avg
    c96|01/09/2020|Tuesday|week|393.07
    c96|02/09/2020|Wednesday|week|10.38
    c96|03/09/2020|Thursday|week|429.35
    c96|04/09/2020|Friday|week|156.20
    c96|05/09/2020|Saturday|weekend|346.22
    c96|06/09/2020|Sunday|weekend|106.53
    c96|08/09/2020|Tuesday|week|194.74
    c96|10/09/2020|Thursday|week|66.30
    c96|17/09/2020|Thursday|week|163.84
    c96|18/09/2020|Friday|week|261.81
    c96|19/09/2020|Saturday|weekend|410.30
    c96|20/09/2020|Sunday|weekend|266.28
    c96|23/09/2020|Wednesday|week|346.18
    c96|24/09/2020|Thursday|week|20.67
    c96|25/09/2020|Friday|week|222.23
    c96|26/09/2020|Saturday|weekend|449.84
    c96|27/09/2020|Sunday|weekend|438.47
    c96|28/09/2020|Monday|week|10.44
    c96|29/09/2020|Tuesday|week|293.59
    c96|30/09/2020|Wednesday|week|194.49"""
    df = pd.read_csv(io.StringIO(s), sep='|')
    

    I add a new id with similar data for groupby clarity sake:

    df2 = df.copy()
    df2['id'] = 'c97'
    df = pd.concat([df, df2])
    

    MCVE

    You do not have to resort to any explicit loop, instead take advantage of the apply method which operates on frames and also works with groupby.

    To do that, we define a function performing the desired test on a DataFrame (groupby will call this method for each sub dataframe corresponding to combination of grouped keys):

    def ttest(x):
        g = x.groupby('tow').agg({'daily_avg': list})
        r = stats.ttest_ind(g.loc['week', 'daily_avg'], g.loc['weekend', 'daily_avg'], equal_var=False)
        s = {k: getattr(r, k) for k in r._fields}
        return pd.Series(s)
    

    Then it suffices to chain apply after the groupby call:

    T = df.groupby('id').apply(ttest)
    

    Results is about:

         statistic    pvalue
    id                      
    c96  -2.128753  0.059126
    c97  -2.128753  0.059126
    

    Refactoring

    Once you have understood the power of this methodology, you can refactor the above code into a reusable function such as:

    def ttest(x, y):
        return stats.ttest_ind(x, y, equal_var=False)
    
    def apply_test(x, subgroup='tow', value='daily_avg', key1='week', key2='weekend', test=ttest):
        g = x.groupby(subgroup).agg({value: list})
        r = test(g.loc[key1, value], g.loc[key2, value])
        return pd.Series({k: getattr(r, k) for k in r._fields})
    
    T = df.groupby('id').apply(apply_test, subgroup='anotherbucket', key1='experience', key2='reference', value='threshold')
    

    Which allow you to adapt statistical tests and DataFrame columns with respect to your needs.