I have looked quite extensively on stackoverflow and elsewhere and I can't seem to find an answer to the problem below.
I am trying to modify a parameter of a function that is itself a parameter inside the GridSearchCV
function of sklearn. More specifically, I want to change parameters (here preserve_case = False
) inside the casual_tokenize
function that is passed to the parameter tokenizer
of the function CountVectorizer
.
Here's the specific code :
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from nltk import casual_tokenize
Generating dummy data from 20newsgroup
categories = ['alt.atheism', 'comp.graphics', 'sci.med',
'soc.religion.christian']
twenty_train = fetch_20newsgroups(subset='train',
categories=categories,
shuffle=True,
random_state=42)
Creating classification pipeline.
Note that the tokenizer can be modified using lambda
. I am wondering if there's another way to do it since it is not working with GridSearchCV
.
text_clf = Pipeline([('vect',
CountVectorizer(tokenizer=lambda text:
casual_tokenize(text,
preserve_case=False))),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(twenty_train.data, twenty_train.target) # this works fine
I then want to compare the default tokenizer of CountVectorizer
with the one in nltk. Note that I am asking the question because I would like to compare more than one tokenizer that each have specific parameters that needs to be specified.
parameters = {'vect':[CountVectorizer(),
CountVectorizer(tokenizer=lambda text:
casual_tokenize(text,
preserve_case=False))]}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5)
gs_clf = gs_clf.fit(twenty_train.data[:100], twenty_train.target[:100])
gs_clf.fit
gives the following error : PicklingError: Can't pickle <function at 0x1138c5598>: attribute lookup on main failed
So my questions are :
GridSearchCV
.1) Does anybody know how to deal with this issue specifically with GridSearchCV.
You can use partial
instead of lambda
from functools import partial
from sklearn.externals.joblib import dump
def add(a, b):
return a + b
plus_one = partial(add, b=1)
plus_one_lambda = lambda a: a + 1
dump(plus_one, 'add.pkl') # No problem
dump(plus_one_lambda, 'add.pkl') # Pickling error
For your case:
tokenizer=partial(casual_tokenize, preserve_case=False)
2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?
I think using lambda
or partial
are both "pythonic ways".
The problem here is that GridSearchCV
uses multiprocessing. Which means it may start multiple processes, it have to serialize the parameters in one process and pass them to others (and then the target processes deserialize to get the same parameters).
GridSearchCV use joblib
for multiprocessing/ serialization. Joblib cannot handle lambda
functions.