Search code examples
pythonpandasdataframegroup-byduplicates

Sum values from two columns in the group if not duplicate


I have pandas dataframes that look similar to this:

df = pd.DataFrame({"week": [1, 1, 1, 1, 1],
                   "area1_code1": ["A", "A", "A", "A", "C"],
                   "area1_code2": ["A1", "A1", "A2", "A2", "C1"],
                   "area1_member": [10, 10, 8, 8, 2],
                   "area2_code1": ["B", "B", "B", "B", "D"],
                   "area2_code2": ["B1", "B2", "B1", "B2", "D1"],
                   "area2_member": [3, 3, 3, 3, 6]})
week area1_code1 area1_code2 area1_member area2_code1 area2_code2 area2_member
1 A A1 10 B B1 3
1 A A1 10 B B2 3
1 A A2 8 B B1 3
1 A A2 8 B B2 3
1 C C1 2 D D1 6

I want to group the data by area1_code1 and area2_code1 or by week then find the sum of area1_member and area2_member of all unique area1_code2 and area2_code2 in the group.

Desired output by area1_code1 and area2_code1:

week area1_code1 area2_code1 members
1 A B 24
1 C D 8

Desired output by week:

week members
1 32

I have tried the following which did not give me the desired result by area1_code1 and area2_code1 though it produced the correct result by week.

area1 = df[["week", "area1_code1", "area1_code2", "area1_member"]].drop_duplicates(["week", "area1_code2"])
area1.rename(columns={"area1_code1": "area_code1",
                      "area1_code2": "area_code2",
                      "area1_member": "area_member"}, inplace=True)
area2 = df[["week", "area2_code1", "area2_code2", "area2_member"]].drop_duplicates(["week", "area2_code2"])
area2.rename(columns={"area2_code1": "area_code1",
                      "area2_code2": "area_code2",
                      "area2_member": "area_member"}, inplace=True)
result = pd.concat([area1, area2]).drop_duplicates().reset_index(drop=True)
week area_code1 area_code2 area_member
1 A A1 10
1 A A2 8
1 C C1 2
1 B B1 3
1 B B2 3
1 D D1 6
result_week = result.groupby("week")["area_member"].sum().reset_index()
week area_member
1 32

I have to do this for hundreds of dataframes with more than a million rows. What would be the most efficient solution to this problem?

Thanks!


Edited

The members of B1 and B2 could be different e.g. B1 could be 3 and B2 could be 4. @Laurent B.'s answer would give the correct results if that is the case.


Solution

  • Proposed code:

    import pandas as pd
    
    df = pd.DataFrame({"week": [1, 1, 1, 1, 1],
                       "area1_code1": ["A", "A", "A", "A", "C"],
                       "area1_code2": ["A1", "A1", "A2", "A2", "C1"],
                       "area1_member": [10, 10, 8, 8, 2],
                       "area2_code1": ["B", "B", "B", "B", "D"],
                       "area2_code2": ["B1", "B2", "B1", "B2", "D1"],
                       "area2_member": [3, 3, 3, 3, 6]})
    
    def func(g):
        # Selective duplicate dropping
        g1 = g.drop_duplicates(["area1_code2"], ignore_index=True)
        g2 = g.drop_duplicates(["area2_code2"], ignore_index=True)
        # Perform and add calculations to Members list
        return g1["area1_member"].sum()+g2["area2_member"].sum()
    
    g_cols = ["week", "area1_code1", "area2_code1"]
    # Add  'Members' dataframe
    r1 = df[g_cols].drop_duplicates().reset_index(drop=True)
    # Pass each filtered group by 'g_cols' to function 'func'
    r2 = df.groupby(g_cols).apply(lambda g : func(g)).to_frame(name="Members").reset_index(drop=True)
    r = r1.merge(r2, left_index=True, right_index=True)
    
    print(r)
    

    Result:

       week area1_code1 area2_code1  Members
    0     1           A           B       24
    1     1           C           D        8