Search code examples
pythonpandasdataframeloopsvectorization

Compute mean value of rows that has the same column value in Pandas


I'm trying to combine three pandas DataFrames together

One of them (called major) has a column category where each row has a unique label :

major_df = pd.DataFrame(np.random.randint(0, 100, size=(3, 2)), columns=list("AB"))
major_df["category"] = pd.Series(["cat_A", "cat_B", "cat_C"])
    A   B category
0  90  17    cat_A
1  36  81    cat_B
2  90  67    cat_C

Two other dfs (called minor) contains multiple rows and have their own unique column names. Each df has a column category where each row has a value that is present in the major df category column :

minor_dfs = {}
for k, cols in zip(("1st", "2nd"), ("CD", "EF")):
    minor_dfs[k] = pd.DataFrame(np.random.randint(0, 100, size=(8, 2)), columns=list(cols))
    minor_dfs[k]["category"] = np.random.choice(["cat_A", "cat_B", "cat_C"], 8)

Here is an example of one of those minor dfs. The only difference between both is that first minor df has the columns C and D, where the second has columns E and F.

    C   D category
0  71  44    cat_C
1   5  88    cat_C
2   8  78    cat_C
3  31  27    cat_C
4  42  48    cat_B
5  18  18    cat_B
6  84  23    cat_A
7  94  23    cat_A

So, my goal is to compute the mean of the values in minor dfs based on the category column, so that at the end, I have the following dfs :

           C      D
cat_A  89.00  23.00
cat_B  30.00  33.00
cat_C  28.75  59.25

where each column contains the mean of the values that are in each category.


For that, I made the following code, where we create empty DataFrames with the column values of the minor dfs and indices from the different values of categories. I then fill this DataFrame using a for loop where I iterate over every value of the index.

copy_dfs = {}
for k, min_df in minor_dfs.items():
    # Get columns from minor df
    # Get index from category of major df
    col_names = min_df.columns.values
    ind_values = major_df.category.values

    # Create a df with columns and indices and set values to np.nan
    copy_df = pd.DataFrame(np.nan, index=ind_values, columns=col_names)
    copy_df = copy_df.drop("category", axis=1)

    # For each category in the index of the dataframe
    for maj_category in copy_df.index:
        # Select rows in minor df where category is the same as major df category
        minor_rows = min_df[min_df.category == maj_category]
        minor_rows = minor_rows.drop("category", axis=1)
        # Compute the mean values (by column) of the rows that were selected
        # Add the mean values into copy_df, where the index corresponds to major df category
        copy_df.loc[maj_category] = minor_rows.mean()

    # Store into dict
    copy_dfs[k] = copy_df

Yet, I think that this code could be optimized using vectorized operations, especially in the part where I iterate for each row. So I was wondering if there was an easier and clever way to accomplish what I'm trying to do ?


Solution

  • This?

    import pandas as pd
    
    df = pd.read_excel('test.xlsx')
    df1 = df.groupby(['category']).mean()
    print(df)
    print(df1)
    

    output:

        C   D category
    0  71  44    cat_C
    1   5  88    cat_C
    2   8  78    cat_C
    3  31  27    cat_C
    4  42  48    cat_B
    5  18  18    cat_B
    6  84  23    cat_A
    7  94  23    cat_A
    
    
                  C      D
    category
    cat_A     89.00  23.00
    cat_B     30.00  33.00
    cat_C     28.75  59.25