I've got a dataframe with dates, and I want to select the highest date in each week excluding weekends (so Fridays, if available), unless there is no Monday to Friday data and Saturday/Sunday are the only one's available.
The sample data can be setup like this:
dates = pd.Series(data=['2018-11-05', '2018-11-06', '2018-11-07', '2018-11-08', '2018-11-09',
'2018-11-12', '2018-11-13', '2018-11-14', '2018-11-15', '2018-11-17',
'2018-11-19',
'2018-12-01',
])
nums = np.random.randint(50, 100, 12)
# nums
# array([95, 80, 81, 51, 98, 62, 50, 55, 59, 77, 69])
df = pd.DataFrame(data={'dates': dates, 'nums': nums})
df['dates'] = pd.to_datetime(df['dates'])
The records I want:
My current solution is in the answer below but I don't think it's ideal and has some issues I had to work around. Briefly, it's:
df.groupby(df['dates'].dt.week).apply(some_function)
Ideally, I'd like a way to write:
[latest Mon-Fri record] if [has Mon-Fri record] else [latest Sat-Sun record]
Create a new hierarchy of weekdays, where Saturday and Sunday are given the lowest priority. Then sort_values
on this new ranking + groupby
+ .tail(1)
.
import numpy as np
wd_map = dict(zip(np.arange(0,7,1), np.roll(np.arange(0,7,1),-2)))
# {0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 0, 6: 1}
df = df.assign(day_mapped = df.dates.dt.weekday.map(wd_map)).sort_values('day_mapped')
df.groupby(df.dates.dt.week).tail(1).sort_index()
dates nums day_mapped
4 2018-11-09 57 6
8 2018-11-15 83 5
10 2018-11-19 96 2
11 2018-12-01 66 0
If your data span multiple years, you'll need to group on both Year
+ week
.