Search code examples
pythonpandasdataframetimedatetime-comparison

Comparing and Subtracting Dates


I'm looking to find a way to determine if a time in a column falls within 7 days of another date in that same column.

Say this is my data frame-

dic = {'firstname':['Rick','Rick','Rick','John','John','John','David',
                    'David','David','Steve','Steve','Steve','Jim','Jim',
                    'Jim'],
       'lastname':['Smith','Smith','Smith','Jones','Jones','Jones',
                   'Wilson','Wilson','Wilson','Johnson','Johnson',
                   'Johnson','Miller','Miller','Miller'],
       'company':['CFA','CFA','CFA','WND','WND','WND','INO','INO','INO',
                  'CHP','CHP','CHP','MCD','MCD','MCD'],
       'faveday':['2020-03-16','2020-03-11','2020-03-25','2020-04-30',
                  '2020-05-22','2020-05-03','2020-01-31','2020-01-13',
                  '2020-01-10','2020-10-22','2020-10-28','2020-10-22',
                  '2020-10-13','2020-10-28','2020-10-20']}
df = pd.DataFrame(dic)
df['faveday'] = pd.to_datetime(df['faveday'])
print(df)

With output-

   firstname lastname company    faveday
0       Rick    Smith     CFA 2020-03-16
1       Rick    Smith     CFA 2020-03-11
2       Rick    Smith     CFA 2020-03-25
3       John    Jones     WND 2020-04-30
4       John    Jones     WND 2020-05-22
5       John    Jones     WND 2020-05-03
6      David   Wilson     INO 2020-01-31
7      David   Wilson     INO 2020-01-13
8      David   Wilson     INO 2020-01-10
9      Steve  Johnson     CHP 2020-10-22
10     Steve  Johnson     CHP 2020-10-28
11     Steve  Johnson     CHP 2020-10-22
12       Jim   Miller     MCD 2020-10-13
13       Jim   Miller     MCD 2020-10-28
14       Jim   Miller     MCD 2020-10-20

And then I sort the data with-

df = df.sort_values(['firstname','lastname','company','faveday'])
print(df)

to get-

   firstname lastname company    faveday
8      David   Wilson     INO 2020-01-10
7      David   Wilson     INO 2020-01-13
6      David   Wilson     INO 2020-01-31
12       Jim   Miller     MCD 2020-10-13
14       Jim   Miller     MCD 2020-10-20
13       Jim   Miller     MCD 2020-10-28
3       John    Jones     WND 2020-04-30
5       John    Jones     WND 2020-05-03
4       John    Jones     WND 2020-05-22
1       Rick    Smith     CFA 2020-03-11
0       Rick    Smith     CFA 2020-03-16
2       Rick    Smith     CFA 2020-03-25
9      Steve  Johnson     CHP 2020-10-22
11     Steve  Johnson     CHP 2020-10-22
10     Steve  Johnson     CHP 2020-10-28

Say I want to know in this current order (index 8 then 7, 6, 12, etc.) whether or not a date is within 7 days of another. (So index 8 and 7 would both yield true but index 6 would not)

But I also would like to keep it grouped by name. (So index 12 and 14 would be true and 13 would not in the Jim Miller group, but index 9, 11, and 10 would all be true in the Steve Johnson group)

Is there a way to subtract dates within each group and then create a column to say TRUE or FALSE dependent upon it being within 7 days of another day?

I'm looking for an output like this-

   firstname lastname company    faveday seven_days
8      David   Wilson     INO 2020-01-10       TRUE
7      David   Wilson     INO 2020-01-13       TRUE
6      David   Wilson     INO 2020-01-31      FALSE
12       Jim   Miller     MCD 2020-10-13       TRUE
14       Jim   Miller     MCD 2020-10-20       TRUE
13       Jim   Miller     MCD 2020-10-28      FALSE
3       John    Jones     WND 2020-04-30       TRUE
5       John    Jones     WND 2020-05-03       TRUE
4       John    Jones     WND 2020-05-22      FALSE
1       Rick    Smith     CFA 2020-03-11       TRUE
0       Rick    Smith     CFA 2020-03-16       TRUE
2       Rick    Smith     CFA 2020-03-25      FALSE
9      Steve  Johnson     CHP 2020-10-22       TRUE
11     Steve  Johnson     CHP 2020-10-22       TRUE
10     Steve  Johnson     CHP 2020-10-28       TRUE

Solution

  • Let us try self define a function with numpy broadcast

    def sefd (x): 
        return np.sum((np.abs(x.values-x.values[:,None])/np.timedelta64(1, 'D'))<=7,axis=1)>=2
    s=df.groupby(['firstname', 'lastname', 'company'])['faveday'].transform(sefd)
    Out[301]: 
    0      True
    1      True
    2     False
    3      True
    4     False
    5      True
    6     False
    7      True
    8      True
    9      True
    10     True
    11     True
    12     True
    13    False
    14     True
    Name: faveday, dtype: bool
    df['seven_days']=s