Search code examples
pythonpandasdataframedatetimefillna

Fill NA values in ordered manner based on condition


CustomerID Buying_Round Date Purchase_amount$
1001 2 01/02/2020 20
1001 NaN 07/03/2020 42
1001 NaN 01/01/2020 15
1002 NaN 01/07/2020 10
1002 NaN 07/04/2020 40
1002 NaN 12/11/2020 25
1003 1 22/02/2020 30
1003 NaN 14/03/2020 52
1003 NaN 10/10/2020 45

Customer Id is the unique number of each customer based on some confidential data. This is for a grocery store which is trying to understand how a customer tends to buy based on the round he comes in and the amount he purchases to improve inventory. Buying round is the the nth time they have come to store.

What I have done is out of this jumbled info I sorted the customer Id I can also sort the date but wanted to keep the problem as original as possible. After sorting the dates it is evident that the date influences the Buying Round and I intent to keep the buying round for calculation of Round vs purchase. Now I want to fill the buying round in increasing order from 1- the number of times the unique customer occur and then start from 1 again for the next customer.

Expected Output :

CustomerID Buying_Round Date Purchase_amount$
1001 2 01/02/2020 20
1001 3 07/03/2020 42
1001 1 01/01/2020 15
1002 2 01/07/2020 10
1002 1 07/04/2020 40
1002 3 12/11/2020 25
1003 1 22/02/2020 30
1003 2 14/03/2020 52
1003 3 10/10/2020 45

Note: The 1001 is an example the original data have 1001 occur 12 times 1002 occur 4 times and 1003 occur 15 times with total 11000 unique customers for this year with no pattern or fixed values to determine how many number for each unique id, we sure have value_counts but want to work with a simpler way other than hardcoding it.


Solution

  • We can try groupby rank after converting the Date column to_datetime:

    df['Buying_Round'] = (
        pd.to_datetime(df['Date'], dayfirst=True)
            .groupby(df['CustomerID']).rank(method='dense')
            .astype(int)
    )
    

    Or with sort_values and groupby cumcount:

    df['Buying_Round'] = (
            pd.to_datetime(df['Date'], dayfirst=True)
            .sort_values()
            .groupby(df['CustomerID']).cumcount() + 1
    )
    

    Both Produce:

       CustomerID  Buying_Round        Date  Purchase_amount$
    0        1001             2  01/02/2020                20
    1        1001             3  07/03/2020                42
    2        1001             1  01/01/2020                15
    3        1002             2  01/07/2020                10
    4        1002             1  07/04/2020                40
    5        1002             3  12/11/2020                25
    6        1003             1  22/02/2020                30
    7        1003             2  14/03/2020                52
    8        1003             3  10/10/2020                45