I am very new to python and pandas and am working on a pandas data frame which is a 3 month data and looks like:
Date ID product
Jul-1 1 A
Jul-1 2 B
Jul-1 3 D
Jul-4 2 F
Jul-5 1 G
Jul-9 1 K
Jul-9 2 L
Jul-9 3 X
Jul-11 1 N
Jul-13 3 M
Jul-17 1 Z
Jul-17 2 O
Jul-17 3 T
What I am trying to do is to select the ID's only at the increment of 7 days, such that it does not select any ID if it is present for the next seven days from the last date it was picked. The resulting data frame would look like this
Date ID product
Jul-1 1 A
Jul-1 2 B
Jul-1 3 D
Jul-9 1 K
Jul-9 2 L
Jul-9 3 X
Jul-17 1 Z
Jul-17 2 O
Jul-17 3 T
Have been trying since morning and any help would be highly appreciated
Thanks a lot in advance!!
Use the data from @suicidalteddy, with minor edits
import numpy as np
import panas as pd
dat = pd.DataFrame([{'Date': '07/01/2016', 'ID': 1, 'product':'A'},
{'Date': '07/01/2016', 'ID': 2, 'product':'B'},
{'Date': '07/01/2016', 'ID': 3, 'product':'D'},
{'Date': '07/04/2016', 'ID': 2, 'product':'F'},
{'Date': '07/05/2016', 'ID': 1, 'product':'G'},
{'Date': '07/09/2016', 'ID': 1, 'product':'K'},
{'Date': '07/09/2016', 'ID': 2, 'product':'L'},
{'Date': '07/09/2016', 'ID': 3, 'product':'X'},
{'Date': '07/11/2016', 'ID': 1, 'product':'N'},
{'Date': '07/13/2016', 'ID': 3, 'product':'M'},
{'Date': '07/17/2016', 'ID': 1, 'product':'Z'},
{'Date': '07/17/2016', 'ID': 2, 'product':'O'},
{'Date': '07/17/2016', 'ID': 3, 'product':'T'},])
dat.Date = pd.to_datetime(dat.Date, format="%m/%d/%Y")
print(dat)
Date ID product
0 2016-07-01 1 A
1 2016-07-01 2 B
2 2016-07-01 3 D
3 2016-07-04 2 F
4 2016-07-05 1 G
5 2016-07-09 1 K
6 2016-07-09 2 L
7 2016-07-09 3 X
8 2016-07-11 1 N
9 2016-07-13 3 M
10 2016-07-17 1 Z
11 2016-07-17 2 O
12 2016-07-17 3 T
Now, use a helper function to
Date
s (per ID
)Then, within each group (use groupby
), use these 2 helper functions to calculate the time and week. Since you require only the first entry of each week (7-days), do a second groupby
and retrieve the first record.
Here is the code to calculate the time and week as separate columns
num_weeks
is from this SO postdat = dat.sort_values(by=['ID','Date'])
dat['time'] = pd.Series()
dat['week'] = pd.Series([1]*len(dat))
def calc_time(df):
df.loc[:,'time'] = df['Date'] - df.loc[df.index.min(), 'Date']
num_weeks = (
int(np.ceil((dat['Date'].max() - \
dat['Date'].min())/np.timedelta64(1, 'W')))
)
df = calc_week(df, num_weeks) # num_weeks = 3
return df
def calc_week(df, num_weeks):
for week in range(1, num_weeks+1):
# print(str(week*7), str((week+1)*7))
df.loc[\
(df['time'] > str(week*7)+' days') & \
(df['time'] <= str((week+1)*7)+' days'), \
'week'] = week+1
return df
dat = dat.groupby(['ID']).apply(calc_time)
Here is the output of the above step
print(dat)
Date ID product time week
0 2016-07-01 1 A 0 days 1
4 2016-07-05 1 G 4 days 1
5 2016-07-09 1 K 8 days 2
8 2016-07-11 1 N 10 days 2
10 2016-07-17 1 Z 16 days 3
1 2016-07-01 2 B 0 days 1
3 2016-07-04 2 F 3 days 1
6 2016-07-09 2 L 8 days 2
11 2016-07-17 2 O 16 days 3
2 2016-07-01 3 D 0 days 1
7 2016-07-09 3 X 8 days 2
9 2016-07-13 3 M 12 days 2
12 2016-07-17 3 T 16 days 3
Now, the second groupby
and sort to get your final desired output
dat = dat.groupby(['ID','week']).first().reset_index(drop=False)
dff = (
dat[['Date','ID','product']].sort_values(by=['Date','ID'])
.reset_index(drop=True)
)
print(dff)
Date ID product
0 2016-07-01 1 A
1 2016-07-01 2 B
2 2016-07-01 3 D
3 2016-07-09 1 K
4 2016-07-09 2 L
5 2016-07-09 3 X
6 2016-07-17 1 Z
7 2016-07-17 2 O
8 2016-07-17 3 T
Original Attempt
Generate some data in same format as OP
idx = pd.date_range('2018-04-01', '2018-05-01', freq='1D') + \
pd.DateOffset(days=16)
df = pd.DataFrame(idx, columns=['Date'])
df.set_index('Date', inplace=True)
df['ID'] = range(len(idx))
df['product'] = range(12,12+len(idx))
print(df)
ID product
Date
2018-04-17 0 12
2018-04-18 1 13
...
2018-05-16 29 41
2018-05-17 30 42
Use pd.data_range
to pick out only the required intervals
idx = pd.date_range(df.index[0], df.index[-1], freq='7D')
df_seven_days = df.loc[idx].reset_index(drop=False)
df_seven_days.rename(columns={'index': 'Date'}, inplace=True)
print(df_seven_days)
Date ID product
2018-04-17 0 12
2018-04-24 7 19
2018-05-01 14 26
2018-05-08 21 33
2018-05-15 28 40