Search code examples
python-3.xpandasdataframenumpy-ndarraypython-datetime

Changing column datatype from Timestamp to datetime64


I have a database I'm reading from excel as a pandas dataframe, and the dates come in Timestamp dtype, but I need them to be in np.datetime64, so that I can make calculations.

I am aware that the function pd.to_datetime() and the astype(np.datetime64[ns]) method do work. However, I am unable to update my dataframe to yield this datatype, for whatever reason, using the code mentioned above.

I have also tried creating an acessory dataframe from the original one, with just the dates that I wish to update the typing, converting it to np.datetime64 and plugging it back onto the original dataframe:

dfi = df['dates']
dfi = pd.to_datetime(dfi)
df['dates'] = dfi

But still it doesn't work. I have also tried updating values one by one:

arr_i = df.index
for i in range(len(arr_i)):
    df.at[arri[l],'dates'].to_datetime64()

Edit The root problem seems to be that the dtype of the column gets updated to np.datetime64, but somehow, when getting single values from within, they still have the dtype = Timestamp

Does anyone have a suggestion of a workaround that is fairly fast?


Solution

  • Pandas tries to standardize all forms of datetimes by storing them as NumPy datetime64[ns] values when you assign them to a DataFrame. But when you try to access individual datetime64 values, they are returned as Timestamps.

    There is a way to prevent this automatic conversion from happening however: Wrap the list of values in a Series of dtype object:

    import numpy as np
    import pandas as pd
    
    # create some dates, merely for example
    dates = pd.date_range('2000-1-1', periods=10)
    # convert the dates to a *list* of datetime64s
    arr = list(dates.to_numpy())
    # wrap the values you wish to protect in a Series of dtype object.
    ser = pd.Series(arr, dtype='object')
    
    # assignment with `df['datetime64s'] = ser` would also work
    df = pd.DataFrame({'timestamps': dates,
                       'datetime64s': ser})
    
    df.info()
    # <class 'pandas.core.frame.DataFrame'>
    # RangeIndex: 10 entries, 0 to 9
    # Data columns (total 2 columns):
    # timestamps     10 non-null datetime64[ns]
    # datetime64s    10 non-null object
    # dtypes: datetime64[ns](1), object(1)
    # memory usage: 240.0+ bytes
    
    print(type(df['timestamps'][0]))
    # <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    
    print(type(df['datetime64s'][0]))
    # <class 'numpy.datetime64'>
    

    But beware! Although with a little work you can circumvent Pandas' automatic conversion mechanism, it may not be wise to do this. First, converting a NumPy array to a list is usually a sign you are doing something wrong, since it is bad for performance. Using object arrays is a bad sign since operations on object arrays are generally much much slower than equivalent operations on arrays of native NumPy dtypes.

    You may be looking at an XY problem -- it may be more fruitful to find a way to (1) work with Pandas Timestamps instead of trying to force Pandas to return NumPy datetime64s or (2) work with datetime64 array-likes (e.g. Series of NumPy arrays) instead of handling values individually (which causes the coersion to Timestamps).