I'm a beginner, and I have the following code below.
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
pca = PCA()
model = GaussianNB()
steps = [('pca', pca), ('model', model)]
pipeline = Pipeline(steps)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
modelwithpca = GridSearchCV(pipeline, param_grid= ,cv=cv)
modelwithpca.fit(X_train,y_train)
This is a local testing, what I'm trying to accomplish is,
i. Perform PCA on the dataset
ii. Use Gaussian Naive Bayes with only the default parameters
iii. Use StratifiedShuffleSplit
So in the end I want the above steps to be carried over to another function that dumps the classifier, the dataset and the feature list to test for performance.
dump_classifier_and_data(modelwithpca, dataset, features)
In the param_grid part, I don't want to test any list of parameters. I just want to have the default parameters used in Gaussian Naive Bayes if that makes sense. What do I change?
Also should there be any changes as to how I instantiate the classifier objects?
The purpose of GridSearchCV
is to test with different parameters for at least one thing in your pipeline (if you don't want to test for different parameters you don't need to use GridSearchCV
).
So, in general, if you want let's say to test for different PCA
n_components
.
The format to use a pipeline with GridSearchCV
would be the following:
gscv = GridSearchCV(pipeline, param_grid={'{step_name}__{parameter_name}': [possible values]}, cv=cv)
e.g.:
# this would perform cv for the 3 different values of n_components for pca
gscv = GridSearchCV(pipeline, param_grid={'pca__n_components': [3, 6, 10]}, cv=cv)
If you use GridSearchCV
to tune PCA
as above, this of course would mean that your model would have the default values.
If you don't need parameter tuning then GridSearchCV
is not the way to go, since using the default parameters of your model for GridSearchCV like this, will only produce a parameter grid with one combination, so it would be like just performing only CV. It wouldn't make sense to do it like this - if I have understood your question correctly:
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
pca = PCA()
model = GaussianNB()
steps = [('pca', pca), ('model', model)]
pipeline = Pipeline(steps)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
# get the default parameters of your model and use them as a param_grid
modelwithpca = GridSearchCV(pipeline, param_grid={'model__' + k: [v] for k, v in model.get_params().items()}, cv=cv)
# will run 5 times as your cv is configured
modelwithpca.fit(X_train,y_train)
Hope this helps, good luck!