I am using sklearn
modules to find the best fitting models and model parameters. However, I have an unexpected Index error down below:
> IndexError Traceback (most recent call
> last) <ipython-input-38-ea3f99e30226> in <module>
> 22 s = mean_squared_error(y[ts], best_m.predict(X[ts]))
> 23 cv[i].append(s)
> ---> 24 print(np.mean(cv, 1))
> IndexError: tuple index out of range
what I want to do is to find best fitting regressor and its parameters, but I got above error. I looked into SO
and tried this solution but still, same error bumps up. any idea to fix this bug? can anyone point me out why this error happening? any thought?
my code:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from xgboost.sklearn import XGBRegressor
from sklearn.datasets import make_regression
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]
X, y = make_regression(n_samples=10000, n_features=20)
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
for i, (model, param) in enumerate(zip(models, params)):
best_m = GridSearchCV(model, param)
best_m.fit(X[tr], y[tr])
s = mean_squared_error(y[ts], best_m.predict(X[ts]))
cv[i].append(s)
print(np.mean(cv, 1))
desired output:
if there is a way to fix up above error, I am expecting to pick up best-fitted models with parameters, then use it for estimation. Any idea to improve the above attempt? Thanks
The root cause of your issue is that, while you ask for the evaluation of 6 models in GridSearchCV
, you provide parameters only for the first 2 ones:
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso(), XGBRegressor()]
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}]
The result of enumerate(zip(models, params))
in this setting, i.e:
for i, (model, param) in enumerate(zip(models, params)):
print((model, param))
is
(SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), {'C': [0.01, 1]})
(RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False), {'n_estimators': [10, 20]})
i.e the last 4 models are simply ignored, so you get empty entries for them in cv
:
print(cv)
# result:
[[5950.6018771284835, 5987.293514740653, 6055.368320208183, 6099.316091619069, 6146.478702335218], [3625.3243553665975, 3301.3552182952058, 3404.3321983193728, 3521.5160621260898, 3561.254684271113], [], [], [], []]
which causes the downstream error when trying to get the np.mean(cv, 1)
.
The solution, as already correctly pointed out by Psi in their answer, is to go for empty dictionaries in the models in which you actually don't perform any CV search; omitting the XGBRegressor
(have not installed it), here are the results:
models = [SVR(), RandomForestRegressor(), LinearRegression(), Ridge(), Lasso()]
params2 = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}]
cv = [[] for _ in range(len(models))]
fold = KFold(5,shuffle=False)
for tr, ts in fold.split(X):
for i, (model, param) in enumerate(zip(models, params2)):
best_m = GridSearchCV(model, param)
best_m.fit(X[tr], y[tr])
s = mean_squared_error(y[ts], best_m.predict(X[ts]))
cv[i].append(s)
where print(cv)
gives:
[[4048.660483326826, 3973.984055352062, 3847.7215568088545, 3907.0566348092684, 3820.0517432992765], [1037.9378737329769, 1025.237441119364, 1016.549294695313, 993.7083268195154, 963.8115632611381], [2.2948917095935095e-26, 1.971022007799432e-26, 4.1583774042712844e-26, 2.0229469068846665e-25, 1.9295075684919642e-26], [0.0003350178681602639, 0.0003297411022124562, 0.00030834076832371557, 0.0003355298330301431, 0.00032049282437794516], [10.372789356303688, 10.137748082073076, 10.136028304131141, 10.499159069700834, 9.80779910439471]]
and print(np.mean(cv, 1))
works OK, giving:
[3.91949489e+03 1.00744890e+03 6.11665355e-26 3.25824479e-04
1.01907048e+01]
So, in your case, you should indeed change params
to:
params = [{'C': [0.01, 1]}, {'n_estimators': [10, 20]}, {}, {}, {}, {}]
as already suggested by Psi.