Search code examples
pythonpandasgroup-byscipypipe

Performing ANOVA and mean on pandas GroupBy on undefined number of columns


I need to create a versatile function that I can pipe to a pandas DataFrame that can run ANOVA on an unspecified number of groups and subgroups. I have a number of DataFrames that have two categorical columns (here, cat1 and cat2) and a number of continuous columns (here, just contin_1 and contin2).

       cat1 cat2      contin_Z       contin_Y    ...
0         A   1             08             33
1         B   1             00             86
2         C   2             85             65
3         A   1             82             08
4         A   2             90             85
5         A   3             93             93
6         A   2             13             65
7         A   1             33             90
8         B   2             00             10
9         C   2             58             57
10        C   1             68             68
11        B   1             43             40
12        A   1             35            NaN
13        A   3             75             40
14        A   3             68             53
15        A   2             93             93
16        B   3             18             65
17        C   3             33             28
18        A   1             50             94
19        B   1             25             90

Taking the sample data above, I would want to group it by cat1 and run ANOVA on all the cat2 groups for each cat1 group and also output the mean for each cat1/cat2 combination. I would want to be able to do this regardless of the number of continuous columns -- kind of like how pandas.DataFrame.groupby.mean can operate on any number of columns.

The output should look like this, in some kind of shape: (This is a sample output just to show the shape of the output I'm seeking; the mean and p-values are nonsense values.)

    cat1   cat2  Z_mean   Z_anova_pval    Y_mean   Y_anova_pval    ...
0      A      1     54           .005         43           .076
1      A      2     73           .005         34           .076
2      A      3     34           .005         42           .076
3      B      1     76           .567         32           .002
4      B      2     98           .567         78           .002
5      B      3     73           .567        101           .002
6      C      1     84           .043         15           .041
7      C      2     23           .043         43           .041
8      C      3     82           .043         87           .041

I'm imagining the function would be called something like this:

data.groupby('cat1').pipe(f)

The closest I've got to the function I need is this line passing all the cat2 subgroups to f_oneway with a generator expression, but I can't figure out how to do this within a groupby on cat1, and especially not while also grabbing the mean, which I'm supposing might need to be done as a separate with groupby.agg, but I'm unsure how to make that scalable to however many continuous columns happen to be in a DataFrame.

from scipy import stats
def run_anova(data):
   return stats.f_oneway(*(data[data.cat2==cat].dropna() for cat in data.cat2.unique()))

Edit: Something along these lines seems to produce an accurate result for the p-value side of things.

from scipy.stats import f_oneway as anova

continuous_cols = [ ... list of all desired continuous cols ... ]

cats = data.cat1.unique()

pvals = (
    data
    .groupby(['cat2'])[continuous_cols]
    .agg(
        lambda x: 
        anova(*[x[data.cat1==cat].dropna() for cat in cats])[0]
    )
)


>         contin_Z   contin_Y
>  cat2
>     1       .045       .087
>     2       .654       .945
>     3       .943       .003

However, I can't figure out how to integrate this with needing the mean of each subgroup (other than to do two GroupBys and then merge after resetting one level of the index on the means table):

pvals = pvals.add_suffix('_pval')

means = data.groupby(['cat2', 'cat1'])[continuous_cols].mean()

means = means.add_suffix('_mean')

table = (
    pvals
    .merge(means.reset_index(level=1), left_index=True, right_index=True)
)

Solution

  • You already have a good idea of the steps involved. The gist here is that your means and pvalues are calculated on two different aggregation levels:

    • mean is calculated at the (cat2, cat1) level
    • pvalue is calculated at the cat2 level

    The data.groupby(...).pipe(f) idea isn't really workable because it restricts you to a single level. The solution below will compute the means and the pvalues separately then join them together.

    It does not require you to list the continuous variables in advance, nor change the code to accomodate new variables. It only cares about the categorical variables, which are fixed in your use case.

    # Move the category columns into the index so only we know that only continuos
    # variables remain in the dataframe
    tmp = data.set_index(["cat2", "cat1"])
    
    # The means are easy to calculate. To distinguish them from the later pvalues,
    # add a level to the column names
    means = tmp.groupby(["cat2", "cat1"]).mean()
    means.columns = pd.MultiIndex.from_product([["mean"], means.columns])
    
    # Calculate the pvalues
    from scipy.stats import f_oneway
    
    cat1 = data["cat1"].unique()
    
    def anova(group: pd.DataFrame):
        # We know that every column in `group` is a continuous variable. The
        # categorical variables have been moved to the index
        result = {
            ("pvalue", col): f_oneway(
                *[group.loc[(slice(None), cat), col].dropna() for cat in cat1]
            ).pvalue
            for col in group.columns
        }
        return pd.Series(result)
    
    
    pvalues = tmp.groupby("cat2").apply(anova)
    
    # Merge the results
    result = pd.merge(means.reset_index("cat1"), pvalues, on="cat2").set_index("cat1", append=True)
    

    Result:

                    mean             pvalue          
                contin_Z contin_Y  contin_Z  contin_Y
    cat2 cat1                                        
    1    A     41.600000    56.25  0.341883  0.856163
         B     22.666667    72.00  0.341883  0.856163
         C     68.000000    68.00  0.341883  0.856163
    2    A     65.333333    81.00  0.392376  0.034400
         B      0.000000    10.00  0.392376  0.034400
         C     71.500000    61.00  0.392376  0.034400
    3    A     78.666667    62.00  0.086695  0.610693
         B     18.000000    65.00  0.086695  0.610693
         C     33.000000    28.00  0.086695  0.610693