Search code examples
pythonpandasdata-processing

Select IDs with an increment of 7 days


I am very new to python and pandas and am working on a pandas data frame which is a 3 month data and looks like:

Date     ID   product
Jul-1     1     A
Jul-1     2     B
Jul-1     3     D
Jul-4     2     F
Jul-5     1     G
Jul-9     1     K
Jul-9     2     L
Jul-9     3     X
Jul-11    1     N
Jul-13    3     M
Jul-17    1     Z
Jul-17    2     O
Jul-17    3     T

What I am trying to do is to select the ID's only at the increment of 7 days, such that it does not select any ID if it is present for the next seven days from the last date it was picked. The resulting data frame would look like this

Date     ID   product
Jul-1     1     A
Jul-1     2     B
Jul-1     3     D
Jul-9     1     K
Jul-9     2     L
Jul-9     3     X
Jul-17    1     Z
Jul-17    2     O
Jul-17    3     T

Have been trying since morning and any help would be highly appreciated

Thanks a lot in advance!!


Solution

  • Use the data from @suicidalteddy, with minor edits

    import numpy as np
    import panas as pd
    
    dat = pd.DataFrame([{'Date': '07/01/2016', 'ID': 1, 'product':'A'},
                    {'Date': '07/01/2016', 'ID': 2, 'product':'B'},
                    {'Date': '07/01/2016', 'ID': 3, 'product':'D'},
                    {'Date': '07/04/2016', 'ID': 2, 'product':'F'},
                    {'Date': '07/05/2016', 'ID': 1, 'product':'G'},
                    {'Date': '07/09/2016', 'ID': 1, 'product':'K'},
                    {'Date': '07/09/2016', 'ID': 2, 'product':'L'},
                    {'Date': '07/09/2016', 'ID': 3, 'product':'X'},
                    {'Date': '07/11/2016', 'ID': 1, 'product':'N'},
                    {'Date': '07/13/2016', 'ID': 3, 'product':'M'},
                    {'Date': '07/17/2016', 'ID': 1, 'product':'Z'},
                    {'Date': '07/17/2016', 'ID': 2, 'product':'O'},
                    {'Date': '07/17/2016', 'ID': 3, 'product':'T'},])
    dat.Date = pd.to_datetime(dat.Date, format="%m/%d/%Y")
    
    print(dat)
             Date  ID product
    0  2016-07-01   1       A
    1  2016-07-01   2       B
    2  2016-07-01   3       D
    3  2016-07-04   2       F
    4  2016-07-05   1       G
    5  2016-07-09   1       K
    6  2016-07-09   2       L
    7  2016-07-09   3       X
    8  2016-07-11   1       N
    9  2016-07-13   3       M
    10 2016-07-17   1       Z
    11 2016-07-17   2       O
    12 2016-07-17   3       T
    

    Now, use a helper function to

    1. calculate the time difference between Dates (per ID)
    2. calculate the week (0-7 days is week 1, 7-14 days is week 2, etc.)

    Then, within each group (use groupby), use these 2 helper functions to calculate the time and week. Since you require only the first entry of each week (7-days), do a second groupby and retrieve the first record.

    Here is the code to calculate the time and week as separate columns

    • code to calculate num_weeks is from this SO post
    dat = dat.sort_values(by=['ID','Date'])
    dat['time'] = pd.Series()
    dat['week'] = pd.Series([1]*len(dat))
    
    def calc_time(df):
        df.loc[:,'time'] = df['Date'] - df.loc[df.index.min(), 'Date']
        num_weeks = (
            int(np.ceil((dat['Date'].max() - \
                dat['Date'].min())/np.timedelta64(1, 'W')))
                    )
        df = calc_week(df, num_weeks) # num_weeks = 3
        return df
    
    def calc_week(df, num_weeks):
        for week in range(1, num_weeks+1):
            # print(str(week*7), str((week+1)*7))
            df.loc[\
                (df['time'] > str(week*7)+' days') & \
                (df['time'] <= str((week+1)*7)+' days'), \
            'week'] = week+1
        return df
    
    dat = dat.groupby(['ID']).apply(calc_time)
    

    Here is the output of the above step

    print(dat)
             Date  ID product    time  week
    0  2016-07-01   1       A  0 days     1
    4  2016-07-05   1       G  4 days     1
    5  2016-07-09   1       K  8 days     2
    8  2016-07-11   1       N 10 days     2
    10 2016-07-17   1       Z 16 days     3
    1  2016-07-01   2       B  0 days     1
    3  2016-07-04   2       F  3 days     1
    6  2016-07-09   2       L  8 days     2
    11 2016-07-17   2       O 16 days     3
    2  2016-07-01   3       D  0 days     1
    7  2016-07-09   3       X  8 days     2
    9  2016-07-13   3       M 12 days     2
    12 2016-07-17   3       T 16 days     3
    

    Now, the second groupby and sort to get your final desired output

    dat = dat.groupby(['ID','week']).first().reset_index(drop=False)
    dff = (
        dat[['Date','ID','product']].sort_values(by=['Date','ID'])
                                    .reset_index(drop=True)
        )
    
    print(dff)
            Date  ID product
    0 2016-07-01   1       A
    1 2016-07-01   2       B
    2 2016-07-01   3       D
    3 2016-07-09   1       K
    4 2016-07-09   2       L
    5 2016-07-09   3       X
    6 2016-07-17   1       Z
    7 2016-07-17   2       O
    8 2016-07-17   3       T
    

    Original Attempt

    Generate some data in same format as OP

    idx = pd.date_range('2018-04-01', '2018-05-01', freq='1D') + \
          pd.DateOffset(days=16)
    df = pd.DataFrame(idx, columns=['Date'])
    df.set_index('Date', inplace=True)
    df['ID'] = range(len(idx))
    df['product'] = range(12,12+len(idx))
    
    print(df)
                ID  product
    Date                   
    2018-04-17   0       12
    2018-04-18   1       13
    ...
    2018-05-16  29       41
    2018-05-17  30       42
    

    Use pd.data_range to pick out only the required intervals

    idx = pd.date_range(df.index[0], df.index[-1], freq='7D')
    df_seven_days = df.loc[idx].reset_index(drop=False)
    df_seven_days.rename(columns={'index': 'Date'}, inplace=True)
    
    print(df_seven_days)
          Date  ID  product
    2018-04-17   0       12
    2018-04-24   7       19
    2018-05-01  14       26
    2018-05-08  21       33
    2018-05-15  28       40