Search code examples
pythonpandasnumpydatetimewhere-clause

Numpy Where Changing Timestamps/Datetime to Integers


Not so much a question but something puzzling me.

I have a column of dates that looks something like this:

0              NaT
1       1996-04-01
2       2000-03-01
3              NaT
4              NaT
5              NaT
6              NaT
7              NaT
8              NaT

I'd like to convert it the NaTs to a static value. (Assume I imported pandas as pd and numpy as np).

If I do:

mydata['mynewdate'] = mydata.mydate.replace(
    np.NaN, pd.datetime(1994,6,30,0,0))

All is well, I get:

0       1994-06-30
1       1996-04-01
2       2000-03-01
3       1994-06-30
4       1994-06-30
5       1994-06-30
6       1994-06-30
7       1994-06-30
8       1994-06-30

But if I do:

mydata['mynewdate'] = np.where(
    mydata['mydate'].isnull(), pd.datetime(1994,6,30,0,0),mydata['mydate'])

I get:

0        1994-06-30 00:00:00
1         828316800000000000
2         951868800000000000
3        1994-06-30 00:00:00
4        1994-06-30 00:00:00
5        1994-06-30 00:00:00
6        1994-06-30 00:00:00
7        1994-06-30 00:00:00
8        1994-06-30 00:00:00

This operation converts the original, non-null dates to integers. I thought there might be a mix-up of data types, so I did this:

mydata['mynewdate'] = np.where(
    mydata['mydate'].isnull(), pd.datetime(1994,6,30,0,0),pd.to_datetime(mydata['mydate']))

And still get:

0        1994-06-30 00:00:00
1         828316800000000000
2         951868800000000000
3        1994-06-30 00:00:00
4        1994-06-30 00:00:00
5        1994-06-30 00:00:00
6        1994-06-30 00:00:00
7        1994-06-30 00:00:00
8        1994-06-30 00:00:00

Please note (and don't ask): Yes, I have a better solution for replacing nulls. This question is not about replacing nulls (as the title indicates that it is not) but how numpy where is handling dates. I ask because I will have more complex conditions to select dates to replace in the future, and thought numpy where would do the job.

Any ideas?


Solution

  • It's due to wonky interactions between Numpy's datetime64, Pandas' Timestamp, and/or datetime.datetime. I fixed it by setting the replacement value to be a numpy.datetime64 from the start.

    static_date = np.datetime64('1994-06-30')
    # static_date = np.datetime64(pd.datetime(1994, 6, 30))
    
    mydata.assign(
        mynewdate=np.where(
            mydata.mydate.isnull(),
            static_date,
            mydata.mydate
        )
    )
    
          mydate  mynewdate
    0        NaT 1994-06-30
    1 1996-04-01 1996-04-01
    2 2000-03-01 2000-03-01
    3        NaT 1994-06-30
    4        NaT 1994-06-30
    5        NaT 1994-06-30
    6        NaT 1994-06-30
    7        NaT 1994-06-30
    8        NaT 1994-06-30