Search code examples
pythonpandasloopsgroup-byiteration

In Python I need to do an iterative groupby that access the previous "grouped value" to establish the value of the row of the aggregated column


I have the following dataset that you can replicate with this code:

number_order = [2,2,3,3,5,5,5,6]
number_fakecouriers = [1,2,1,2,1,2,3,3]
dictio = {"number_order":number_order, "number_fakecouriers":number_fakecouriers}
actual_table = pd.DataFrame(dictio)

enter image description here

What I need is to write a code that through a for loop or a groupby generates the following result:

enter image description here

The code should perform a groupby on the column "number_orders" and then take the minimum of the column "number_fakeorders", but each time it should iteratively exclude the minimum values of the column "number_fakeorders" that have been already selected. Then in case there are no more values available it should input a "None".

This is the explanation row by row:

1) "number_orders" = 2 : here the value of "number_fakeorders" is "1", and it is simply the minimum value of "number_fakeorders", where ["number_orders" = 2], because it is the first value that appears

2) "number_orders" = 3 : here the value of "number_fakeorders" is "2" because "1" has been already selected for ["number_orders" = 2], so excluding "1", where ["number_orders" = 3] the minimum value is "2"

3) "number_orders" = 5 : here the value of "number_fakeorders" is "3" because "1" and "2" have been already selected

4) "number_orders" = 6 : here the value of "number_fakeorders" is "None" because the only value of "number_fakeorders" where ["number_orders" = 6] is "3", and "3" has already been selected


Solution

  • Try:

    def fn(x, seen):
        for v in x:
            if v in seen:
                continue
            seen.add(v)
            return v
    
    
    out = (
        actual_table.groupby("number_order")["number_fakecouriers"]
        .apply(fn, seen=set())
        .reset_index()
    )
    print(out)
    

    Prints:

       number_order  number_fakecouriers
    0             2                  1.0
    1             3                  2.0
    2             5                  3.0
    3             6                  NaN
    

    Note: You can sort dataframe before processing (if not sorted already):

    actual_table = actual_table.sort_values(
        by=["number_order", "number_fakecouriers"]
    )
    
    ...