Search code examples
pythonpandasdataframestandard-deviation

Calculate %CV for 2 pandas dataframes (or just standard deviation and mean)


I have two pandas dataframes that are the same length and width. I want to find the %CV between each cell of the two dataframes (or the standard dev and mean, and then I can calculate CV by hand).

For example, if I have something like this:

   A B C
0  1 2 3
1  4 5 6 
2  7 8 9 

and

   A B C
0  3 2 1
1  6 5 4
2  9 8 7

I want to return one dataframe containing the standard deviation/mean/CV between A1 in the first dataframe and A1 in the second dataframe.

Example (mean):

   A B C
0  2 2 2 
1  5 5 5 
2  8 8 8

I want the same for standard deviation and then to calculate %CV via standard deviation / mean.

I have tried converting the dataframes to numpy

stan_dev = np.dstack((arr1.to_numpy(), arr2.to_numpy())).std(axis=2)

but I get this error:

TypeError: loop of ufunc does not support argument 0 of type float which has no callable sqrt method

Thank you!


Solution

  • You can do this with either pandas or numpy. However, I am cautious about what you hope to infer based on the standard deviation/coefficient of variation based on 2 data points.

    With that warning out of the way, you'll note that the 2 answers below have a similar structure-

    1. Combine datasets into a common container
    2. Perform aggregation computations
    3. Operate on those aggregations to calculate coefficient of variation

    NumPy approach

    • stack to combine datasets
    • aggregate with axis=0 allows to aggregate across the lowest dimension
    • calculate coef. of variation
    • reassemble results into a DataFrame
    import numpy as np
    import pandas as pd
    
    arr = np.stack((df1, df2))
    cv_arr = arr.std(axis=0, ddof=1) / arr.mean(axis=0)
    out = pd.DataFrame(data=cv_arr, columns=df1.columns, index=df1.index)
    
    print(out)
              A    B         C
    0  0.707107  0.0  0.707107
    1  0.282843  0.0  0.282843
    2  0.176777  0.0  0.176777
    

    Pure pandas approach

    • pd.concat to combine datasets and specify the keys paramter to obtain use a MultiIndex to represent data higher than 2d
    • groupby to operate across the groups
    • .swaplevels to better leverage index alignment
    • use .pipe to calculate coef of variation
    import pandas as pd
    
    out = (
        pd.concat([df1, df2], keys=[0, 1])
        .groupby(level=1)
        .agg(['mean', 'std'])
        .swaplevel(axis=1)
        .pipe(lambda df_: df_['std'] / df_['mean'])
    )
    
    print(out)
              A    B         C
    0  0.707107  0.0  0.707107
    1  0.282843  0.0  0.282843
    2  0.176777  0.0  0.176777
    

    I'd personally recommend the numpy solution because these data & analysis are better represented by a 3d array than a hierarchical DataFrame.