Search code examples
pythonpandasdataframejupyter-notebooksubset

Subset df in Python Pandas using two columns in a loop


I have got 1 huge df (over 5 million rows) and 5 columns in jupyter notebook. 'Name' column has 20 unique values and 'lot' column has 10 unique values. So 200 unique combinations of these two columns. I need to subset the df based on the unique combinations of these two columns, make some calculations and return some parts of the calculations in a final df. Final df will have 200 rows, one for each iteration/subset/combination.

For example (with only 2 names and 3 lots = 6 combinations):

    dfhuge
    index      Name   lot   col3 col4 col5
    123        delta   1    786  10   1
    657        delta   2    787  11   2
    567        delta   2    777  13   4
    456        bravo   3    775  12   3
    789        bravo   3    772  14   5

For 1 of the 6 iterations/combinations, I could use

df1outof6 = dfhuge.loc[(dfhuge["Name"] == "delta") & (dfhuge["lot"] == 2)]
df1outof6
   index       Name   lot   col3 col4 col5
    657        delta   2    787  11   2
    567        delta   2    777  13   4
mean = df1outof6["col4"].mean()
sum = df1outof6["col5"].sum()
... 

I want the above operation repeated for all the 6 subsets using a loop.

Final df should be:

    finaldf
    newcol                     col4mean  col5sum  
    combination1(delta and 2)      12     6   
    combination2(delta and 1)      10     1    
    combination3(delta and 3)      0      0   
    combination4(bravo and 1)      0      0    
    combination5(bravo and 2)      0      0
    combination6(bravo and 3)      13     8  

I need a loop, result of which will be the finaldf. I can't use df.loc to subset each combination because I originally have 200 of them.


Solution

  • Since you did not provide how to create your test df, I made my own and changed column names. Hope it's ok

    import numpy as np
    import pandas as pd
    from itertools import product
    df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
    
    rows = []
    unique_a = list(df["A"].unique())
    unique_b = list(df["B"].unique())
    unique_a.sort()
    unique_b.sort()
    key_error=0
    for a_index, b_index in product(unique_a, unique_b):
        df_subset =  df.loc[(df["A"] == a_index) & (df["B"] == b_index),:]
        calc0 = 0
        calc1 = 0
        if len(df_subset):
            calc0 = df_subset["C"].sum()
            calc1 = df_subset["D"].mean()
        rows.append(((a_index, b_index), calc0, calc1))
            
        
    final_df = pd.DataFrame(
        rows,  
        columns = ["index combination", "calc1", "calc2"]
    ).set_index("index combination")
    
    final_df
    

    Initial df

        A   B   C   D
    0   1   7   3   4
    1   6   1   3   9
    2   7   9   1   4
    3   8   7   4   6
    4   0   3   3   3
    ... ... ... ... ...
    95  7   5   5   7
    96  0   4   9   0
    97  0   3   2   1
    98  4   2   9   2
    99  6   5   6   6
    

    Final df

        calc1   calc2
    index combination       
    (0, 0)  0   0.0
    (0, 1)  21  5.0
    (0, 2)  1   9.0
    (0, 3)  20  3.0
    (0, 4)  9   0.0
    ... ... ...
    (9, 5)  1   0.0
    (9, 6)  5   7.0
    (9, 7)  0   0.0
    (9, 8)  0   0.0
    (9, 9)  0   0.0