I used to create loop for finding the best parameters for my model which increased my errors in coding so I decided to use GridSearchCV
.
I am trying to find out the best parameters for PCA for my model (the only parameter I want to grid search on).
In this model, after normalization I want to combine the original features with the PCA reduced features and then apply the linear SVM.
Then I save the whole model to predict my input on.
I have an error in the line where I try to fit the data so I can use best_estimator_
and best_params_
functions.
The error says: TypeError: The score function should be a callable, all (<type 'str'>) was passed.
I did not use any parameters for which I might need to give string in GridSearchCV
so not sure why I have this error
I also want to know if the line print("shape after model",X.shape)
before saving my model, should should print (150, 7) and (150, 5)
both based on all possible parameter?
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib
from numpy import array
iris = load_iris()
X, y = iris.data, iris.target
print(X.shape) #prints (150, 4)
print (y)
#cretae models and piplline them
combined_features = FeatureUnion([("pca", PCA()), ("univ_select", SelectKBest(k='all'))])
svm = SVC(kernel="linear")
pipeline = Pipeline([("scale", StandardScaler()),("features", combined_features), ("svm", svm)])
# Do grid search over n_components:
param_grid = dict(features__pca__n_components=[1,3])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search.fit(X, y)
print("best parameters", grid_search.best_params_)
print("shape after model",X.shape) #should this print (150, 7) or (150, 5) based on best parameter?
#save the model
joblib.dump(grid_search.best_estimator_, 'model.pkl', compress = 1)
#new data to predict
Input=[ 2.9 , 4. ,1.2 ,0.2]
Input= array(Input)
#use the saved model to predict the new data
modeltrain="model.pkl"
modeltrain_saved = joblib.load(modeltrain)
model_predictions = modeltrain_saved.predict(Input.reshape(1, -1))
print(model_predictions)
I updated the code based on the answers
You are supplying 'all'
as a param in SelectKBest. But according to the documentation, if you want to pass 'all', you need to specify it as:
SelectKBest(k='all')
The reason is that its a keyword argument, it should be specified with the keyword. Because the first argument to SelectKBest is a positional argument for the scoring function. So when you do not specify the param
, 'all' is considered an input for the function and hence the error.
Update:
Now about the shape, the original X
will not be changed. So it will print (150,4)
. The data will be changed on the fly and on my pc the best_param_
is n_components=1
, so final shape that goes to svm is (150, 5)
, 1 from PCA and 4 from SelectKBest.