Search code examples
pythonscikit-learnxgboost

MultiInputOutput Model RandomSearch with Scikit Pipelines


I am trying to compare different regression stategies for a forecasting problem:

  • Using algorithms that support multiple input output regression by default (i.e Linear Regression, Trees etc..).
  • Using algorithms a wrapper to do multiple input output regression (i.e SVR, XGboost)
  • Using the chained regressor to exploit correlations between my targets (as my forecast at t+1 is auto-correlated with the target at t+2).

The documentation of scikit for the multiple input output wrappers is actually not that good but it is mentioned that:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html

set_params(**params)[source]¶
Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). 
The latter have parameters of the form <component>__<parameter> so that it’s possible to
update each component of a nested object.

Therefore I am building my pipeline as:

pipeline_xgboost = Pipeline([('scaler', StandardScaler()),
                             ('variance_selector', VarianceThreshold(threshold=0.03)), 
                             ('estimator', xgb.XGBRegressor())])

And then creating the wrapper as:

mimo_wrapper = MultiOutputRegressor(pipeline_xgboost)

Following the documentation of scikit pipelines I am defining my xgboost parameters as:

parameters = [
    {
        'estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
        'estimator__max_depth': [10, 100, 1000]
         etc...
    }

And then I am running my cross validation as:

randomized_search = RandomizedSearchCV(mimo_wrapper, perparameters, random_state=0, n_iter=5,
                                       n_jobs=-1, refit=True, cv=3, verbose=True,
                                       pre_dispatch='2*n_jobs', error_score='raise', 
                                       return_train_score=True,
                                       scoring='neg_mean_absolute_error')

However I am getting the following issue:

ValueError: Invalid parameter reg_alpha for estimator Pipeline(steps=[('scaler', StandardScaler()),
                ('variance_selector', VarianceThreshold(threshold=0.03)),
                ('estimator',
                 XGBRegressor(base_score=None, booster=None,
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, gamma=None, gpu_id=None,
                              importance_type='gain',
                              interaction_constraints=None, learning_rate=None,
                              max_delta_step=None, max_depth=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, n_estimators=100,
                              n_jobs=None, num_parallel_tree=None,
                              random_state=None, reg_alpha=None,
                              reg_lambda=None, scale_pos_weight=None,
                              subsample=None, tree_method=None,
                              validate_parameters=None, verbosity=None))]). Check the list of available parameters with `estimator.get_params().keys()`.

Did I missunderstood the documentation of scikit? I have also tried with setting the parameters as estimator__estimator__param as maybe this is the way to access the parameters when they are in the mimo_wrapper but this as proved unsuccesfull. (Example below):

parameters = {
    'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'estimator__estimator__max_depth': [10, 100, 1000]
}


random_grid = RandomizedSearchCV(estimator=pipeline_xgboost, param_distributions=parameters,random_state=0, n_iter=5,
                                       n_jobs=-1, refit=True, cv=3, verbose=True,
                                       pre_dispatch='2*n_jobs', error_score='raise', 
                                       return_train_score=True,
                                       scoring='neg_mean_absolute_error')

hyperparameters_tuning = random_grid.fit(df.drop(columns=TARGETS+UMAPS),
                              df[TARGETS])
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_11898/2539017483.py in <module>
----> 1 hyperparameters_tuning = random_grid.fit(final_file_df_with_aggregates.drop(columns=TARGETS+UMAPS),
      2                               final_file_df_with_aggregates[TARGETS])

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    889                 return results
    890 
--> 891             self._run_search(evaluate_candidates)
    892 
    893             # multimetric is determined here because in the case of a callable

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1764     def _run_search(self, evaluate_candidates):
   1765         """Search n_iter candidates from param_distributions"""
-> 1766         evaluate_candidates(
   1767             ParameterSampler(
   1768                 self.param_distributions, self.n_iter, random_state=self.random_state

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
    836                     )
    837 
--> 838                 out = parallel(
    839                     delayed(_fit_and_score)(
    840                         clone(base_estimator),

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1054 
   1055             with self._backend.retrieval_context():
-> 1056                 self.retrieve()
   1057             # Make sure that we get a last message telling us we are done
   1058             elapsed_time = time.time() - self._start_time

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
    933             try:
    934                 if getattr(self._backend, 'supports_timeout', False):
--> 935                     self._output.extend(job.get(timeout=self.timeout))
    936                 else:
    937                     self._output.extend(job.get())

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    437                 raise CancelledError()
    438             elif self._state == FINISHED:
--> 439                 return self.__get_result()
    440             else:
    441                 raise TimeoutError()

/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

Funny enough I have noticed that when setting the estimator parameters outside the random search function this works well:

parameters = dict({
    'estimator__max_depth': [10, 100, 1000]
})

mimo_wrapper.estimator.set_params(estimator__max_depth=200)

And as you can see the max_depth is now changed.

Pipeline(steps=[('scaler', StandardScaler()),
                ('variance_selector', VarianceThreshold(threshold=0.03)),
                ('estimator',
                 XGBRegressor(base_score=None, booster=None,
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, gamma=None, gpu_id=None,
                              importance_type='gain',
                              interaction_constraints=None, learning_rate=None,
                              max_delta_step=None, max_depth=200,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, n_estimators=100,
                              n_jobs=None, num_parallel_tree=None,
                              random_state=None, reg_alpha=None,
                              reg_lambda=None, scale_pos_weight=None,
                              subsample=None, tree_method=None,
                              validate_parameters=None, verbosity=None))])

Solution

  • Dear colleagues it seems that this was due to a problem in XGB.Regressor in any case the right way of creating parameters for the MultiOutput Regressor within a pipeline it would be:

    parameters = {
        'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
        'estimator__estimator__max_depth': [10, 100, 1000]
    }