Search code examples
pythonpandaspandas-groupby

Compute Average/Mean across Dataframes in Python Pandas


I have a list of dataframes. Each dataframe was originally numerical data taken from which are all shaped identically with 21 rows and 5 columns. The first column is an index (index 0 to index 20). I want to compute the average (mean) values into a single dataframe. Then I want to export the dataframe to excel.

Here's a simplified version of my existing code:

#look to concatenate the dataframes together all at once
#dataFrameList is the given list of dataFrames
concatenatedDataframes = pd.concat(dataFrameList, axis = 1)

#grouping the dataframes by the index, which is the same across all of the dataframes
groupedByIndex = concatenatedDataframes.groupby(level = 0)

#take the mean 
meanDataFrame = groupedByIndex.mean()

# Create a Pandas Excel writer using openpyxl as the engine.
writer = pd.ExcelWriter(filepath, engine='openpyxl')
meanDataFrame.to_excel(writer)

However, when I open the excel file, I see what looks like EVERY dataframe is copied into the sheet and the average/mean values are not shown. A simplified example is shown below (cutting most of the rows and dataframes)

              Dataframe 1                   Dataframe 2                   Dataframe 3
Index  Col2   Col3   Col4   Col5     Col2   Col3   Col4   Col5     Col2   Col3   Col4   Col5
0      Data   Data   Data   Data     Data   Data   Data   Data     Data   Data   Data   Data
1      Data   Data   Data   Data     Data   Data   Data   Data     Data   Data   Data   Data
2      Data   Data   Data   Data     Data   Data   Data   Data     Data   Data   Data   Data
....

I'm looking for something more like:

           Averaged DF
Index  Col2                                   Col3                                   Col4
0      Mean Index0,Col2 across DFs    Mean Index0,Col3 across DFs    Mean Index0,Col4 across DFs
1      Mean Index1,Col2 across DFs    Mean Index1,Col3 across DFs    Mean Index1,Col4 across DFs
2      Mean Index2,Col2 across DFs    Mean Index2,Col3 across DFs    Mean Index3,Col4 across DFs
...

I have also already seen this answer: Get the mean across multiple Pandas DataFrames

If possible, I'm looking for a clean solution, not one which would simply involve looping through each dataFrame value by value. Any suggestions?


Solution

  • Perhaps I misunderstood what you asked

    The solution is simple. You just need to concat along the correct axis

    dummy data

    df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[10 + i * j for j in range(columns)] for i in range(rows) ])
    df2 = df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[i + j for j in range(columns)] for i in range(rows) ])
    

    ps. this should be your job as OP

    pd.concat

    df_concat0 = pd.concat((df1, df2), axis=1)
    

    puts all the dataframes next to eachother.

        0   1   0   1
    0   10  10  0   1
    1   10  11  1   2
    2   10  12  2   3
    

    If we want to do a groupby now, we first need to stack, groupby and stack again

    df_concat0.stack().groupby(level=[0,1]).mean().unstack()

        0   1
    0   5.0     5.5
    1   5.5     6.5
    2   6.0     7.5
    

    If we do

    df_concat = pd.concat((df1, df2))
    

    This puts all the dataframes on top of eachother

        0   1
    0   10  10
    1   10  11
    2   10  12
    0   0   1
    1   1   2
    2   2   3
    

    now we need to just groupby the index, like you did

    df_concat.groupby(level=0).mean()

        0   1
    0   5.0     5.5
    1   5.5     6.5
    2   6.0     7.5
    

    and then use ExcelWriter as context manager

    with pd.ExcelWriter(filepath, engine='openpyxl') as writer:
        result.to_excel(writer)
    

    or just plain

    result.to_excel(filepath, engine='openpyxl') 
    

    if you can overwrite what is is filepath