I have a user-input time series dataframe whose date column supposedly consists of all consecutive Mondays and Thursdays during an extended period of time stretched over few years. Specifically, "date" column (which is the first column) has either one of these formats: "Monday mm/dd/yyyy" or "Thursday mm/dd/yyyy". (The first few rows of the series are shown as an example even though there is no typo. One can imagine that having a much much longer period of time stretched over few years, it is more likely to encounter typos)
Monday 07/31/2023
Thursday 07/27/2023
Monday 07/24/2023
Thursday 07/20/2023
Monday 07/17/2023
Friday 07/13/2023
Thursday 07/10/2023
Monday 07/06/2023
Thursday 07/03/2023
Monday 06/29/2023
I would like to check this column against the following tests as concise as possible:
You can try:
tmp = df['date'].str.split(expand=True)
# are the dates valid?
check1 = (pd.to_datetime(df['date'], format='%A %m/%d/%Y')
.dt.strftime('%A %m/%d/%Y')
.eq(df['date']).all()
)
# are the days a succession of Monday/Thursday?
check2 = tmp[0].eq(np.tile(['Monday', 'Thursday'], len(tmp)//2)[:len(tmp)]).all()
# are the time differences a succession of alternating 3/4 days?
s = pd.to_datetime(tmp[1], format='%m/%d/%Y').diff(-1).iloc[:-1]
check3 = ( (set(s.iloc[:2].dt.days) == {3, 4})
and s.iloc[2:].eq(s.shift(2).iloc[2:]).all()
)
print(check1, check2, check3)
# False False True
Used input:
df = pd.DataFrame({'date': ['Monday 07/31/2023',
'Thursday 07/27/2023',
'Monday 07/24/2023',
'Thursday 07/20/2023',
'Monday 07/17/2023',
'Friday 07/13/2023',
'Thursday 07/10/2023',
'Monday 07/06/2023',
'Thursday 07/03/2023',
'Monday 06/29/2023']})
If you want to identify the offending rows:
tmp = df['date'].str.split(expand=True)
# are the dates valid?
check1 = (pd.to_datetime(df['date'], format='%A %m/%d/%Y')
.dt.strftime('%A %m/%d/%Y')
.eq(df['date'])
)
def alternating(s, cat):
s = pd.Series(pd.Categorical(s, categories=cat).codes, index=s.index)
N = len(cat)-1
return s.eq((s+N).mod(N+1).shift(N).fillna(s)) & s.between(0, N)
# are the days a succession of Monday/Thursday?
# this flags False if the current or previous day in not alternating M/T
check2 = alternating(tmp[0], cat=['Monday', 'Thursday'])
# are the time differences a succession of alternating 3/4 days?
check3 = alternating(pd.to_datetime(tmp[1], format='%m/%d/%Y').diff(-1).dt.days.iloc[:-1],
cat=[3, 4]
).reindex(tmp.index, method='ffill')
df['valid_date?'] = check1
df['Mon/Thu?'] = check2
df['3/4 days'] = check3
df['invalid?'] = ~(check1 & check2 & check3)
print(df)
Output:
date valid_date? Mon/Thu? 3/4 days invalid?
0 Monday 07/31/2023 True True True False
1 Thursday 07/27/2023 True True True False
2 Monday 07/24/2023 True True True False
3 Thursday 07/20/2023 True True True False
4 Monday 07/17/2023 True True True False
5 Friday 07/13/2023 False False True True
6 Thursday 07/10/2023 False False True True
7 Monday 07/06/2023 False True True True
8 Thursday 07/03/2023 False True True True
9 Monday 06/29/2023 False True True True