Search code examples
pandasdatetimepandas-groupbypandas-timeindex

Pandas: cut date column into period date groups/bins


I have a dataframe as below:

df = pd.DataFrame({'Id': ['abs1', 'abs2', 'abs3', 'plo2', '201805', '201806', '202011', 'pctx1'],
                   'Date': ['2021-06-15', '2021-06-13', '2021-06-07', '2021-05-30',
                            '2021-05-12', '2021-04-28', '2021-04-15', '2021-02-01']})

I wish to bin the Date column into several groups in a new column, called Date_Bin, the rule is: from today's date, if the value in the Date is less than 7 days, then the value in the new column will be 'last 7 days', if the value is less than 14 days and more than 7 days from today, the value is '7 to 14 days', if the value is less than 30 days and more than 14 days, then the value is '14 to 30 days', same logic for 30 to 60 days, 60 to 90 days, and more than 90 days. The ideal output is like this:

       Id        Date           Date_Bin
0    abs1  2021-06-15        last 7 days
1    abs2  2021-06-13        last 7 days
2    abs3  2021-06-07       7 to 14 days
3    plo2  2021-05-30      14 to 30 days
4  201805  2021-05-10      30 to 60 days
5  201806  2021-04-28      30 to 60 days
6  202011  2021-04-15      60 to 90 days
7   pctx1  2021-02-01  more than 90 days

As you can see the output, those are the only groups/bins I need for the data. I tried a couple of ways and did not work, so much appreciate it if anyone can help please.


Solution

  • Convert your dates with to_datetime then subtract from today's normalized date (so that we remove the time part) and get the number of days. Then use pd.cut to group them appropriately.

    Anything in the future gets labeled with NaN.

    import pandas as pd
    import numpy as np
    
    df['Date'] = pd.to_datetime(df['Date'])
    s =  (pd.to_datetime('today').normalize() - df['Date']).dt.days
    
    df['Date_Bin'] = pd.cut(s, [0, 7, 14, 30, 60, 90, np.inf],
                            labels=['last 7 days', '7 to 14 days', '14 to 30 days',
                                    '30 to 60 days', '60 to 90 days', 'more than 90 days'],
                            include_lowest=True)
    

    print(df)
    
           Id       Date           Date_Bin
    0    abs1 2021-06-15        last 7 days
    1    abs2 2021-06-13        last 7 days
    2    abs3 2021-06-07       7 to 14 days
    3    plo2 2021-05-30      14 to 30 days
    4  201805 2021-05-12      30 to 60 days
    5  201806 2021-04-28      30 to 60 days
    6  202011 2021-04-15      60 to 90 days
    7   pctx1 2021-02-01  more than 90 days
    

    For future reproducibility, at the time of writing:

    pd.to_datetime('today').normalize()
    #Timestamp('2021-06-15 00:00:00')