Search code examples
pythonpandasmachine-learningtype-conversionboosting

Using datetime64 feature type in building a model?


I have a dataframe which includes around 50 features. In my experiment I have a classification problem so I want to train the model by "GradientBoostingClassifier". The dataframe (mydata) is considered as a training set. One of those 50 features (feature20) is a date, and I need to consider this feature in my training set as well, so I have tried to convert the date to datetime64 as follows:

  mydata['feature20']=pd.to_datetime(mydata['feature20'])

Now, when I tried to train the model with the classifier, it gives me the following error:

  float() argument must be a string or a number, not 'Timestamp'

Any idea to solve this problem?


Solution

  • You can easily convert your dates to integers: df["feature20"].astype("int64") // 10**9.

    NOTE: but it's not a good idea to leave a datetime feature as it is, unless you are working with the time series. Usually you would want to extract additional information from that datetime - day of week, day of month, week of year, month #, etc


    Demo:

    In [9]: df = pd.DataFrame({'feature20':pd.date_range('2010-01-01', periods=10)})
    
    In [10]: df["new"] = df["feature20"].astype("int64") // 10**9
    
    In [11]: df
    Out[11]:
       feature20         new
    0 2010-01-01  1262304000
    1 2010-01-02  1262390400
    2 2010-01-03  1262476800
    3 2010-01-04  1262563200
    4 2010-01-05  1262649600
    5 2010-01-06  1262736000
    6 2010-01-07  1262822400
    7 2010-01-08  1262908800
    8 2010-01-09  1262995200
    9 2010-01-10  1263081600
    
    In [12]: df["date"] = pd.to_datetime(df["new"], unit="s")
    
    In [13]: df
    Out[13]:
       feature20         new       date
    0 2010-01-01  1262304000 2010-01-01
    1 2010-01-02  1262390400 2010-01-02
    2 2010-01-03  1262476800 2010-01-03
    3 2010-01-04  1262563200 2010-01-04
    4 2010-01-05  1262649600 2010-01-05
    5 2010-01-06  1262736000 2010-01-06
    6 2010-01-07  1262822400 2010-01-07
    7 2010-01-08  1262908800 2010-01-08
    8 2010-01-09  1262995200 2010-01-09
    9 2010-01-10  1263081600 2010-01-10
    

    if you have microsecond precision:

    In [28]: df = pd.DataFrame({'feature20':pd.date_range('2010-01-01 01:01:01.123456', freq="123S", periods=10)})
    
    In [29]: df
    Out[29]:
                       feature20
    0 2010-01-01 01:01:01.123456
    1 2010-01-01 01:03:04.123456
    2 2010-01-01 01:05:07.123456
    3 2010-01-01 01:07:10.123456
    4 2010-01-01 01:09:13.123456
    5 2010-01-01 01:11:16.123456
    6 2010-01-01 01:13:19.123456
    7 2010-01-01 01:15:22.123456
    8 2010-01-01 01:17:25.123456
    9 2010-01-01 01:19:28.123456
    
    In [30]: df["new"] = df["feature20"].astype("int64") // 10**3
    
    In [31]: df
    Out[31]:
                       feature20               new
    0 2010-01-01 01:01:01.123456  1262307661123456
    1 2010-01-01 01:03:04.123456  1262307784123456
    2 2010-01-01 01:05:07.123456  1262307907123456
    3 2010-01-01 01:07:10.123456  1262308030123456
    4 2010-01-01 01:09:13.123456  1262308153123456
    5 2010-01-01 01:11:16.123456  1262308276123456
    6 2010-01-01 01:13:19.123456  1262308399123456
    7 2010-01-01 01:15:22.123456  1262308522123456
    8 2010-01-01 01:17:25.123456  1262308645123456
    9 2010-01-01 01:19:28.123456  1262308768123456
    
    In [32]: df["date"] = pd.to_datetime(df["new"], unit="us")
    
    In [33]: df
    Out[33]:
                       feature20               new                       date
    0 2010-01-01 01:01:01.123456  1262307661123456 2010-01-01 01:01:01.123456
    1 2010-01-01 01:03:04.123456  1262307784123456 2010-01-01 01:03:04.123456
    2 2010-01-01 01:05:07.123456  1262307907123456 2010-01-01 01:05:07.123456
    3 2010-01-01 01:07:10.123456  1262308030123456 2010-01-01 01:07:10.123456
    4 2010-01-01 01:09:13.123456  1262308153123456 2010-01-01 01:09:13.123456
    5 2010-01-01 01:11:16.123456  1262308276123456 2010-01-01 01:11:16.123456
    6 2010-01-01 01:13:19.123456  1262308399123456 2010-01-01 01:13:19.123456
    7 2010-01-01 01:15:22.123456  1262308522123456 2010-01-01 01:15:22.123456
    8 2010-01-01 01:17:25.123456  1262308645123456 2010-01-01 01:17:25.123456
    9 2010-01-01 01:19:28.123456  1262308768123456 2010-01-01 01:19:28.123456