Search code examples
pandasnumpytimedelta

Pandas dataframe filtered with np.where returns wrong data


Python 3.8.10, Pandas 1.4.4, Numpy 1.23.2

I'm receiving a data streaming from a machine every 10 seconds (in the database times are precisely stored every 10 seconds: no 9.5 nor 11). The machine goes sometimes in standby so the streaming stops. Occasionally it sends random data when in standby.

To eliminate the spurious data I use a parameter SECONDS: when set to 10, each line is considered valid if there was a stored line 10 seconds before.

import pandas as pd

df = pd.DataFrame(
    {"DT": [   # two occasional readings
        "2022-09-21 00:38:40", "2022-09-21 01:59:30",
        # stable data flow 
        "2022-09-21 04:31:40", "2022-09-21 04:31:50",
        "2022-09-21 04:32:00", "2022-09-21 04:32:10",
        "2022-09-21 04:32:20", "2022-09-21 04:32:30",
        "2022-09-21 04:32:40", "2022-09-21 04:32:50",
        # missed one reading here
        "2022-09-21 04:33:10", "2022-09-21 04:33:20", 
        "2022-09-21 04:33:30", "2022-09-21 04:33:40",
        "2022-09-21 04:33:50", "2022-09-21 04:34:00", 
        "2022-09-21 04:34:10", "2022-09-21 04:34:20",
        "2022-09-21 04:34:30"
    ], 
     "notes": [
        "occasional mistake",
        "occasional mistake",
        "first stable reading",
        "", "", "", "", "", "", "",
        "one reading missing",
        "", "", "", "", "", "", "", ""
     ] },   
)

df.DT = df.DT.astype('datetime64[ns]')

SECONDS = 10  # <====== change here the accepted interval 

df["deltaDT"] = (df["DT"] - df["DT"].shift(SECONDS // 10)).dt.total_seconds()
df["ok"] = np.where(df["deltaDT"]==SECONDS, "ok", "-")


print(df)

This is for me the most intuitive solution, and for 10 seconds it returns the correct answer: enter image description here

(Note: The reading at 2022-09-21 04.33.10 is lost, not an issue here)

Problem arise with only some SECONDS values: 10, 20, 40, 50 are OK. Choosing 30 and 60, the error appears.

This is SECONDS = 30:

enter image description here

It is obviously a problem related to binary conversion. Quite difficult to spot because if I am working with floats that really represent only integer values or NaN, there shouldn't be an infinitesimal epsilon involved.

Changing the filter condition to

df["ok"] = np.where(df["deltaDT"]<=SECONDS+EPSILON, "ok", "-")

it return the wrong result for EPSILON=1E-15, and the correct one for EPSILON=1E-14.

My question now (sorry I was longer than the Harry Potter saga):

How to convert df["deltaDT"] = (df["DT"] - df["DT"].shift(SECONDS // 10)).dt.total_seconds() to integer?

Adding .astype("Int64") generates the error "TypeError: cannot safely cast non-equivalent float64 to int64"


Solution

  • to convert your column that contains NaN to integer, it need to be nullable integer:

    df["deltaDT"] = df["deltaDT"].astype(pd.Int64Dtype())