python-3.x pandas dataframe numpy-ndarray python-datetime

Changing column datatype from Timestamp to datetime64

I have a database I'm reading from excel as a pandas dataframe, and the dates come in Timestamp dtype, but I need them to be in np.datetime64, so that I can make calculations.

I am aware that the function pd.to_datetime() and the astype(np.datetime64[ns]) method do work. However, I am unable to update my dataframe to yield this datatype, for whatever reason, using the code mentioned above.

I have also tried creating an acessory dataframe from the original one, with just the dates that I wish to update the typing, converting it to np.datetime64 and plugging it back onto the original dataframe:

dfi = df['dates']
dfi = pd.to_datetime(dfi)
df['dates'] = dfi

But still it doesn't work. I have also tried updating values one by one:

arr_i = df.index
for i in range(len(arr_i)):
    df.at[arri[l],'dates'].to_datetime64()

Edit The root problem seems to be that the dtype of the column gets updated to np.datetime64, but somehow, when getting single values from within, they still have the dtype = Timestamp

Does anyone have a suggestion of a workaround that is fairly fast?

Solution

Pandas tries to standardize all forms of datetimes by storing them as NumPy datetime64[ns] values when you assign them to a DataFrame. But when you try to access individual datetime64 values, they are returned as Timestamps.

There is a way to prevent this automatic conversion from happening however: Wrap the list of values in a Series of dtype object:

import numpy as np
import pandas as pd

# create some dates, merely for example
dates = pd.date_range('2000-1-1', periods=10)
# convert the dates to a *list* of datetime64s
arr = list(dates.to_numpy())
# wrap the values you wish to protect in a Series of dtype object.
ser = pd.Series(arr, dtype='object')

# assignment with `df['datetime64s'] = ser` would also work
df = pd.DataFrame({'timestamps': dates,
                   'datetime64s': ser})

df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 2 columns):
# timestamps     10 non-null datetime64[ns]
# datetime64s    10 non-null object
# dtypes: datetime64[ns](1), object(1)
# memory usage: 240.0+ bytes

print(type(df['timestamps'][0]))
# <class 'pandas._libs.tslibs.timestamps.Timestamp'>

print(type(df['datetime64s'][0]))
# <class 'numpy.datetime64'>

But beware! Although with a little work you can circumvent Pandas' automatic conversion mechanism, it may not be wise to do this. First, converting a NumPy array to a list is usually a sign you are doing something wrong, since it is bad for performance. Using object arrays is a bad sign since operations on object arrays are generally much much slower than equivalent operations on arrays of native NumPy dtypes.

You may be looking at an XY problem -- it may be more fruitful to find a way to (1) work with Pandas Timestamps instead of trying to force Pandas to return NumPy datetime64s or (2) work with datetime64 array-likes (e.g. Series of NumPy arrays) instead of handling values individually (which causes the coersion to Timestamps).