I am trying to compare different regression stategies for a forecasting problem:
The documentation of scikit for the multiple input output wrappers is actually not that good but it is mentioned that:
https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html
set_params(**params)[source]¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as Pipeline).
The latter have parameters of the form <component>__<parameter> so that it’s possible to
update each component of a nested object.
Therefore I am building my pipeline as:
pipeline_xgboost = Pipeline([('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator', xgb.XGBRegressor())])
And then creating the wrapper as:
mimo_wrapper = MultiOutputRegressor(pipeline_xgboost)
Following the documentation of scikit pipelines I am defining my xgboost parameters as:
parameters = [
{
'estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__max_depth': [10, 100, 1000]
etc...
}
And then I am running my cross validation as:
randomized_search = RandomizedSearchCV(mimo_wrapper, perparameters, random_state=0, n_iter=5,
n_jobs=-1, refit=True, cv=3, verbose=True,
pre_dispatch='2*n_jobs', error_score='raise',
return_train_score=True,
scoring='neg_mean_absolute_error')
However I am getting the following issue:
ValueError: Invalid parameter reg_alpha for estimator Pipeline(steps=[('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator',
XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, gamma=None, gpu_id=None,
importance_type='gain',
interaction_constraints=None, learning_rate=None,
max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))]). Check the list of available parameters with `estimator.get_params().keys()`.
Did I missunderstood the documentation of scikit? I have also tried with setting the parameters as estimator__estimator__param as maybe this is the way to access the parameters when they are in the mimo_wrapper but this as proved unsuccesfull. (Example below):
parameters = {
'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__estimator__max_depth': [10, 100, 1000]
}
random_grid = RandomizedSearchCV(estimator=pipeline_xgboost, param_distributions=parameters,random_state=0, n_iter=5,
n_jobs=-1, refit=True, cv=3, verbose=True,
pre_dispatch='2*n_jobs', error_score='raise',
return_train_score=True,
scoring='neg_mean_absolute_error')
hyperparameters_tuning = random_grid.fit(df.drop(columns=TARGETS+UMAPS),
df[TARGETS])
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/tmp/ipykernel_11898/2539017483.py in <module>
----> 1 hyperparameters_tuning = random_grid.fit(final_file_df_with_aggregates.drop(columns=TARGETS+UMAPS),
2 final_file_df_with_aggregates[TARGETS])
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
889 return results
890
--> 891 self._run_search(evaluate_candidates)
892
893 # multimetric is determined here because in the case of a callable
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
1764 def _run_search(self, evaluate_candidates):
1765 """Search n_iter candidates from param_distributions"""
-> 1766 evaluate_candidates(
1767 ParameterSampler(
1768 self.param_distributions, self.n_iter, random_state=self.random_state
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
836 )
837
--> 838 out = parallel(
839 delayed(_fit_and_score)(
840 clone(base_estimator),
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1054
1055 with self._backend.retrieval_context():
-> 1056 self.retrieve()
1057 # Make sure that we get a last message telling us we are done
1058 elapsed_time = time.time() - self._start_time
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
933 try:
934 if getattr(self._backend, 'supports_timeout', False):
--> 935 self._output.extend(job.get(timeout=self.timeout))
936 else:
937 self._output.extend(job.get())
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
437 raise CancelledError()
438 elif self._state == FINISHED:
--> 439 return self.__get_result()
440 else:
441 raise TimeoutError()
/anaconda/envs/azureml_py38/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result
Funny enough I have noticed that when setting the estimator parameters outside the random search function this works well:
parameters = dict({
'estimator__max_depth': [10, 100, 1000]
})
mimo_wrapper.estimator.set_params(estimator__max_depth=200)
And as you can see the max_depth is now changed.
Pipeline(steps=[('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator',
XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, gamma=None, gpu_id=None,
importance_type='gain',
interaction_constraints=None, learning_rate=None,
max_delta_step=None, max_depth=200,
min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None, verbosity=None))])
Dear colleagues it seems that this was due to a problem in XGB.Regressor in any case the right way of creating parameters for the MultiOutput Regressor within a pipeline it would be:
parameters = {
'estimator__estimator__reg_alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
'estimator__estimator__max_depth': [10, 100, 1000]
}