Search code examples
pythonpython-3.xpipelineattributeerrorfeature-extraction

Rolling Average on a pipeline to train a model


I have problems to fit a model using a pipeline which looks to add columns with rolling average of some features and then train the model.

Dataframe:

columns=['yr',  'mnth', 'hr',   'season',   'holiday',  'weekday',  'workingday',   'weathersit',   'temp', 'atemp',    'hum',   'windspeed', 'y']
df=pd.DataFrame(np.array([ [0,  1,      0,      1,     0,      6,      0,           1,      0.24,   2.879,  0.81,   0, 16],
[0, 1,      1,      1,   0,    6,      0,           1,      0.22,   2.727,  0.80,   0, 40],
[0, 1,      2,      1,     0,      6,      0,           1,      0.22,   2.727,  0.80,   0, 32],
[0, 1,      3,      1,     0,      6,      0,           1,      0.24,   2.879,  0.75,   0, 13],
[0, 1,      4,      1,     0,      6,      0,           1,      0.24,   2.879,  0.75,   0, 1]]), columns=columns)

X_train=df.drop('y')
y_train=df['y']

Rolling average function to some features:

def rollingAv(Data):
          
    a=Data['atemp']
    a_shifted = a.shift(1)
    a_window = a_shifted.rolling(window=4)
    a_means = a_window.mean()
    Data['a_means'] = a_means

    h=Data['hum']
    h_shifted = h.shift(1)
    h_window = h_shifted.rolling(window=4)
    h_means = h_window.mean()
    Data['h_means'] = h_means

    w=Data['windspeed']
    w_shifted = w.shift(1)
    w_window = w_shifted.rolling(window=4)
    w_means = w_window.mean()
    Data['w_means'] = w_means

    Data=Data.dropna(subset=['a_means', 'h_means','w_means'])
    return Data.values

Rolling average Class to fit and transform on pipeline

class BikeRentalFeatureExtractor(BaseEstimator):
  
  def __init__(self):
    pass

  def fit(self,X, y=None):
    X=X.values
    if y.shape[0]>0:
      y=y[4:]
      return y
    else:
      pass
  
  def transform(x):
    return rollingAv(x)

Pipeline and model

model = Pipeline(steps=[
    ("extractor", BikeRentalFeatureExtractor()),
    ("regressor", RandomForestRegressor())
    ])

parameters = {'regressor__n_estimators':[50,100,200,300]}

st = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

clf = GridSearchCV(estimator=model, param_grid=parameters)

clf.fit(X_train,y_train)

I have no errors until clf.fit(X_train,y_train) when it seems to be related with data because in spite I have the following message, I droped the column, I tried again and the problem continues with the next column:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-86937c1966f0> in <module>()
----> 1 clf.fit(X_train,y_train)

12 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/base.py in _try_aggregate_string_function(self, arg, *args, **kwargs)
    276 
    277         raise AttributeError(
--> 278             f"'{arg}' is not a valid function for '{type(self).__name__}' object"
    279         )
    280 

AttributeError: 'yr' is not a valid function for 'Series' object


Solution

    1. fit is assumed to return self
    2. transform is a method and should have self as first parameter.