Search code examples
pythonpandasdatetimecategorical-datatimedelta

Pandas DataFrame: categorical dtype to datetime


I have a df with column "time_gap" which has a categoricalDtype:

CategoricalDtype(categories=['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                  '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                  '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                  '0 days 09:00:00', '0 days 10:00:00', '0 days 11:00:00',
                  '0 days 12:00:00', '0 days 13:00:00', '0 days 14:00:00',
                  '0 days 15:00:00', '0 days 16:00:00', '0 days 17:00:00',
                  '0 days 18:00:00', '0 days 19:00:00', '0 days 20:00:00',
                  '0 days 21:00:00', '0 days 22:00:00', '0 days 23:00:00'],
, ordered=True)

--> hours:minutes:seconds

I would like to convert it to a datetime dtype (and ideally get rid of the "0 days").

When I try using df["time_gap"] = pd.to_datetime(df["time_gap"]), I get the following error:

TypeError: <class 'pandas._libs.tslibs.timedeltas.Timedelta'> is not convertible to datetime, at position 0

Is there an easy way to convert this categoricalDtype to datetime?

Thank you in advance for your feedbacks.


Solution

  • here're two options how you could access the 'hours' from the categorial column:

    import pandas as pd
    
    # dummy data -->
    df = pd.DataFrame({"time_gap":['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                      '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                      '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                      '0 days 09:00:00', '0 days 10:00:00', '0 days 11:00:00',
                      '0 days 12:00:00', '0 days 13:00:00', '0 days 14:00:00',
                      '0 days 15:00:00', '0 days 16:00:00', '0 days 17:00:00',
                      '0 days 18:00:00', '0 days 19:00:00', '0 days 20:00:00',
                      '0 days 21:00:00', '0 days 22:00:00', '0 days 23:00:00']
    })
    
    df["time_gap"] = pd.to_timedelta(df["time_gap"]).astype("category")
    # <-- dummy data
    
    # via timedelta:
    df["hour"] = df["time_gap"].astype("timedelta64[ns]").dt.total_seconds()/3600
    
    # via datetime:
    df["hour_"] = (pd.Timestamp("2022-01-01") + df["time_gap"].astype("timedelta64[ns]")).dt.hour
    
    print(df)
              time_gap  hour  hour_
    0  0 days 00:00:00   0.0      0
    1  0 days 01:00:00   1.0      1
    2  0 days 02:00:00   2.0      2
    3  0 days 03:00:00   3.0      3
    4  0 days 04:00:00   4.0      4
    5  0 days 05:00:00   5.0      5
    ...
    
    # note that .dt.hour gives you an integer:
    print(df.dtypes)
    time_gap    category
    hour         float64
    hour_          int32
    dtype: object