Consider a datetime-dtype Series containing some null values:
import pandas as pd
t = pd.date_range('2022-01-01', '2022-01-02', freq='4H').to_series('t')
t.iat[2] = None
2022-01-01 00:00:00 2022-01-01 00:00:00
2022-01-01 04:00:00 2022-01-01 04:00:00
2022-01-01 08:00:00 NaT
2022-01-01 12:00:00 2022-01-01 12:00:00
2022-01-01 16:00:00 2022-01-01 16:00:00
2022-01-01 20:00:00 2022-01-01 20:00:00
2022-01-02 00:00:00 2022-01-02 00:00:00
Freq: 4H, dtype: datetime64[ns]
The dtype
of this Series is <M8[ns]
. As far as I understand, this means that the underlying data is physically stored as 64-bit integers representing nanoseconds since the start of the Unix epoch (numpy.datetime64
). However I don't know how NaT
is represented.
I would like to obtain an integer timestamp representing the number of nanoseconds since the start of the Unix epoch, with null values wherever there was an NaT
in the data. Therefore it seems like I should be able to somehow check the precision of the timestamp data type, extract the raw seconds, and then scale up or down as needed to obtain nanoseconds.
Expected output:
2022-01-01 00:00:00 1640995200000000000
2022-01-01 04:00:00 1641009600000000000
2022-01-01 08:00:00 <NA>
2022-01-01 12:00:00 1641038400000000000
2022-01-01 16:00:00 1641052800000000000
2022-01-01 20:00:00 1641067200000000000
2022-01-02 00:00:00 1641081600000000000
Freq: 4H, dtype: Int64
Things that did not work:
t.astype(pd.Int64Dtype()) # datetime64[ns] cannot be converted to IntegerDtype`
t.astype(float).astype(pd.Int64Dtype()) # Cannot cast DatetimeArray to dtype float64
t.map(lambda t: t.timestamp()).astype(pd.Int64Dtype()) # NaTType does not support timestamp
t.map(lambda t: t.timestamp(), na_action='ignore').astype(pd.Int64Dtype()) # 'values' contains non-numeric NA NaT
Notably the .timestamp
logic gets hung up on the NaT
, which I don't know how to cleanly convert to the "null-aware" Int64Dtype
.
Something that did work, but is very ugly:
((t - pd.Timestamp(1970, 1, 1)).dt.total_seconds() * 1e09).astype(pd.Int64Dtype())
Is there an idiomatic way to do this?
Edit
I don't think there is a good way to do that. Another approach:
>>> t.view(np.int64).astype(pd.Int64Dtype()).replace(np.iinfo(np.int64).min, pd.NA)
0 1640995200000000000
1 1641009600000000000
2 <NA>
3 1641038400000000000
4 1641052800000000000
5 1641067200000000000
6 1641081600000000000
Name: t, dtype: Int64
# Without replace(...)
>>> t.view(np.int64).astype(pd.Int64Dtype())
0 1640995200000000000
1 1641009600000000000
2 -9223372036854775808
3 1641038400000000000
4 1641052800000000000
5 1641067200000000000
6 1641081600000000000
Name: t, dtype: Int64
What you expect is unclear for me. Maybe you can use numpy
:
ts = t.values.astype(float)
dti = pd.to_datetime(ts)
>>> ts
array([ 1.64099520e+18, 1.64100960e+18, -9.22337204e+18, 1.64103840e+18,
1.64105280e+18, 1.64106720e+18, 1.64108160e+18])
>>> dti
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 04:00:00',
'NaT', '2022-01-01 12:00:00',
'2022-01-01 16:00:00', '2022-01-01 20:00:00',
'2022-01-02 00:00:00'],
dtype='datetime64[ns]', freq=None)
>>> t
0 2022-01-01 00:00:00
1 2022-01-01 04:00:00
2 NaT
3 2022-01-01 12:00:00
4 2022-01-01 16:00:00
5 2022-01-01 20:00:00
6 2022-01-02 00:00:00
Name: t, dtype: datetime64[ns]