Search code examples
pythonexcelpandasdatetimesumproduct

Pandas counting and suming specific conditions returns only nan


I am trying to follow the otherwise excellent solution provided in the thread pandas-counting-and-summing-specific-conditions, but the code only ever outputs nan values, and with sum (not count), gives a future warning.

Basically, for each row in my df, I want to count how many dates in a column are within a range of +/- 1 days of other dates in the same column.

If I were doing it in excel, the following muti-conditional sumproduct or countifs are possible:

= SUMPRODUCT(--(AN2>=$AN$2:$AN$35000-1),--(AN2<=$AN$2:$AN$35000+1)),

or

=countifs($AN$2:$AN$35000,">="&AN2-1,$AN$2:$AN$35000,"<="&AN2+1)

In python, trying the approach in the linked thread, I believe the code would be:

import pandas as pd
import datetime

df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})

df["caseIntensity"] = df[(df['datet'] <= df['datet'] + datetime.timedelta(days=1)) &\
                             (df['datet'] >= df['datet'] - datetime.timedelta(days=1))].sum()

The output should be: 2, 2, 2, 3, 3, 2. Instead it is wholemeal nan!

Is it correct to assume that because I'm testing conditions, it doesn't matter if I sum or count? If I need to sum, I get a future warning about invalid columns (the columns are valid), which I don't understand. But mostly, my question is why am I only getting nan?


Solution

  • Instead loops in apply is possible use vectorized solution, first create numpy arrays chained by &, compare and for counts Trues is possible use sum:

    a = df['datet']
    b = a + pd.Timedelta(days=1)
    c = a - pd.Timedelta(days=1)
        
    mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
    
    df["caseIntensity"]  = mask.sum(axis=1)
    print (df)
           datet  caseIntensity
    0 2020-03-04              2
    1 2020-03-05              2
    2 2020-03-09              2
    3 2020-03-10              3
    4 2020-03-11              3
    5 2020-03-12              2
    

    Here is perfomance for 6k rows:

    df = pd.DataFrame({'datet': [pd.to_datetime("2020-03-04 00:00:00"), pd.to_datetime("2020-03-05 00:00:00"),\
                             pd.to_datetime("2020-03-09 00:00:00"), pd.to_datetime("2020-03-10 00:00:00"),\
                             pd.to_datetime("2020-03-11 00:00:00"), pd.to_datetime("2020-03-12 00:00:00")]})
    df = pd.concat([df] * 1000, ignore_index=True)
    
    
    In [140]: %%timeit
         ...: a = df['datet']
         ...: b = a + pd.Timedelta(days=1)
         ...: c = a - pd.Timedelta(days=1)
         ...:     
         ...: mask = (a.to_numpy() <= b.to_numpy()[:, None]) & (a.to_numpy() >= c.to_numpy()[:, None])
         ...: 
         ...: df["caseIntensity"]  = mask.sum(axis=1)
         ...: 
    469 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [141]: %%timeit
         ...: df["caseIntensity1"] = df.apply(lambda row: get_dates_in_range(df, row), axis=1)
         ...: 
         ...: 
    6.2 s ± 368 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)