I am trying to follow the otherwise excellent solution provided in the thread pandas-counting-and-summing-specific-conditions, but the code only ever outputs nan values, and with sum (not count), gives a future warning.
Basically, for each row in my df, I want to count how many dates in a column are within a range of +/- 1 days of other dates in the same column.
If I were doing it in excel, the following muti-conditional sumproduct or countifs are possible:
= SUMPRODUCT(--(AN2>=$AN$2:$AN$35000-1),--(AN2<=$AN$2:$AN$35000+1)),
or
=countifs($AN$2:$AN$35000,">="&AN2-1,$AN$2:$AN$35000,"<="&AN2+1)
In python, trying the approach in the linked thread, I believe the code would be:
import pandas as pd
import datetime
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df["caseIntensity"] = df[(df['datet'] <= df['datet'] + datetime.timedelta(days=1)) &\
(df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()
The output should be: 2, 2, 2, 3, 3, 2. Instead it is wholemeal nan!
Is it correct to assume that because I'm testing conditions, it doesn't matter if I sum or count? If I need to sum, I get a future warning about invalid columns (the columns are valid), which I don't understand. But mostly, my question is why am I only getting nan?
Instead loops in apply
is possible use vectorized solution, first create numpy arrays chained by &
, compare and for counts True
s is possible use sum
:
a = df['datet']
b = a + pd.Timedelta(days=1)
c = a - pd.Timedelta(days=1)
mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
df["caseIntensity"] = mask.sum(axis=1)
print (df)
datet caseIntensity
0 2020-03-04 2
1 2020-03-05 2
2 2020-03-09 2
3 2020-03-10 3
4 2020-03-11 3
5 2020-03-12 2
Here is perfomance for 6k rows:
df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
df = pd.concat([df] * 1000, ignore_index=True)
In [140]: %%timeit
...: a = df['datet']
...: b = a + pd.Timedelta(days=1)
...: c = a - pd.Timedelta(days=1)
...:
...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
...:
...: df["caseIntensity"] = mask.sum(axis=1)
...:
469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [141]: %%timeit
...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
...:
...:
6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)