Search code examples
pandasscikit-learnsklearn-pandas

Custom transformer that splits dates into new column


I am following the sklearn_pandas walk through found on the sklearn_pandas README on github and am trying to modify the DateEncoder() custom transformer example to do 2 additional things:

  • Convert string type columns to datetime while taking the date format as a parameter
  • Append the original column names when spitting out the new columns. E.g: if Input Column: Date1 then Outputs: Date1_year, Date1_month, Date_1 day.

Here is my attempt (with a rather rudimentary understanding of sklearn pipelines):

import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper

class DateEncoder(TransformerMixin):

    '''
    Specify date format using python strftime formats
    '''

    def __init__(self, date_format='%Y-%m-%d'):
        self.date_format = date_format

    def fit(self, X, y=None):
        self.dt = pd.to_datetime(X, format=self.date_format)
        return self

    def transform(self, X):
        dt = X.dt
        return pd.concat([dt.year, dt.month, dt.day], axis=1)


data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23', 
                                 '2004-07-20','2007-12-21','2006-12-22','2003-04-23'],   
                     'dates2'  : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23', 
                                 '2014-07-20','2011-12-21','2014-12-22','2015-04-23']})

DATE_COLS = ['dates1', 'dates2']

Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)

But on runtime, I get the following error:

AttributeError: Can only use .dt accessor with datetimelike values

Why am I getting this error and how to fix it? Also any help with renaming the column names as mentioned above with the original columns (Date1_year, Date1_month, Date_1 day) would be greatly appreciated!


Solution

  • I was able to break the data format conversion and date splitter into two separate transformers and it worked.

    import pandas as pd
    from sklearn.base import TransformerMixin
    from sklearn_pandas import DataFrameMapper
    
    
    
    data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23', 
                                     '2004-07-20','2007-12-21','2006-12-22','2003-04-23'],   
                         'dates2'  : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23', 
                                     '2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
    
    class DateFormatter(TransformerMixin):
    
        def fit(self, X, y=None):
            # stateless transformer
            return self
    
        def transform(self, X):
            # assumes X is a DataFrame
            Xdate = X.apply(pd.to_datetime)
            return Xdate
    
    
    class DateEncoder(TransformerMixin):
    
        def fit(self, X, y=None):
            return self
    
        def transform(self, X):
            dt = X.dt
            return pd.concat([dt.year, dt.month, dt.day], axis=1)
    
    
    DATE_COLS = ['dates1', 'dates2']
    
    datemult = DataFrameMapper(
                [ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS     ] 
                , input_df=True, df_out=True)
    
    df = datemult.fit_transform(data2)
    

    This code outputs:

    Out[4]: 
       dates1_0  dates1_1  dates1_2  dates2_0  dates2_1  dates2_2
    0      2001        12        20      2012        12        20
    1      2002        10        21      2009        10        21
    2      2003         8        22      2016         8        22
    3      2004         8        23      2017         8        23
    4      2004         7        20      2014         7        20
    5      2007        12        21      2011        12        21
    6      2006        12        22      2014        12        22
    7      2003         4        23      2015         4        23
    

    However I am still looking for a way to rename the new columns while applying the DateEncoder() transformer. E.g: dates_1_0 --> dates_1_year and dates_2_2 --> dates_2_month. I'd be happy to select that as the solution.