Search code examples
pythonpandasaggregate

pandas data aggregation based on column filters


I have a data frame like this

col1  col2  col3  col4  col5
A     A1    X     1      2
A     A2    Y     2      2
A     A3    Z     1      2
B     B1    X     2      2
B     B2    Y     2      2
B     B3    Z     1      2

And i have a mapping like this.

mapping

col1  col3
 A --> X
 A --> Y
(A,B) --> Z
(A,B) --> (Y,Z)

This mapping works like a filter option that we use in excel. for eg: For A-->X, we'll filter A from col1 and filter X from col3, and get the sum from col4

for (A,B)-->Z, we'll filter both A & B from col1 and Z from col3, whatever sum we get from col4 should be stored.

I have tried groupby and pivot_table methods but couldn't be able to cover all the cases in mapping.

df1 = df.pivot_table(index= [col1,col3], columns = col3, values = col4, aggfunc='sum')

My expected result is.

col1  X   Y   Z   YZ
A     1   2   2   6
B             2   6

Is there any way to get this using groupy or pivot_table, please let me know.


Solution

  • With the dataframe you provided:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "col1": ["A", "A", "A", "B", "B", "B"],
            "col2": ["A1", "A2", "A3", "B1", "B2", "B3"],
            "col3": ["X", "Y", "Z", "X", "Y", "Z"],
            "col4": [1, 2, 1, 2, 2, 1],
            "col5": [2, 2, 2, 2, 2, 2],
        }
    )
    

    Here is one way to do it with Pandas groupby and advanced indexing:

    # Get values
    df = df.groupby(["col1", "col3"]).agg(list)["col4"].apply(lambda x: x[0])
    
    print(df)
    # Output
    
    col1  col3
    A     X       1
          Y       2
          Z       1
    B     X       2
          Y       2
          Z       1
    
    # Assign values to new dataframe
    new_df = pd.DataFrame()
    
    for i in ("X", "Y"):
        new_df.at["A", i] = df.loc[("A", i)]
    
    for i in ("A", "B"):
        new_df.at[i, "Z"] = df.loc["A":"B", "Z", :].reset_index().sum()["col4"]
        new_df.at[i, "YZ"] = df.loc["A":"B", "Y":"Z", :].reset_index().sum()["col4"]
    
    # Cleanup
    new_df = new_df.fillna(0).astype(int).replace(0, "")
    new_df.columns.name = "col1"
    

    Then:

    print(new_df)
    
    # Output
    col1  X  Y  Z  YZ
    A     1  2  2   6
    B           2   6