I am generating a predictive model for cancer diagnosis from a moderately large dataset (>4500 features). I have got the rfecv to work, providing me with a model that I can evaluate nicely using ROC curves, confusion matrices etc., and which is performing acceptably for classifying novel data.
please find a truncated version of my code below.
logo = LeaveOneGroupOut()
model = RFECV(LinearDiscriminantAnalysis(), step=1, cv=logo.split(X, y, groups=trial_number))
model.fit(X, y)
As I say, this works well and provides a model I'm happy with. The trouble is, I would like to be able to save this model, so that I don't need to do the lengthy retraining everytime I want to evaluate new data.
When I have tried to pickle a standard LDA or other model object, this has worked fine. When I try to pickle this RFECV object, however, I get the following error:
Traceback (most recent call last):
File "/rds/general/user/***/home/data_analysis/analysis_report_generator.py", line 56, in <module>
pickle.dump(key, file)
TypeError: cannot pickle 'generator' object
In trying to address this, I have spent a long time trying to RTFM, google extensively and dug as deep as I dared into Stack without any luck.
I would be grateful if anyone could identify what I could do to pickle this model successfully for future extraction and re-use, or whether there is an equivalent way to save the parameters of the feature-extracted LDA model for rapid analysis of new data.
This occurs because LeaveOneGroupOut().split(X, y, groups=groups)
returns a generator object—which cannot be pickled for reasons previously discussed.
To pickle it, you'd have to cast it to a finite number of splits with something like the following, or replace it with StratifiedKFold
which does not have this issue.
rfecv = RFECV(
# ...
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
)
MRE putting all the pieces together (here I've assigned groups randomly):
import pickle
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import LeaveOneGroupOut
from numpy.random import default_rng
rng = default_rng()
X, y = make_classification(n_samples=500, n_features=15, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0)
groups = rng.integers(0, 5, size=len(y))
rfecv = RFECV(
estimator=LinearDiscriminantAnalysis(),
step=1,
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
scoring="accuracy",
min_features_to_select=1,
n_jobs=4,
)
rfecv.fit(X, y)
with open("rfecv_lda.pickle", "wb") as fh:
pickle.dump(rfecv, fh)
Side note: A better method would be to avoid pickling the RFECV in the first place. rfecv.transform(X)
masks feature columns that the search deemed unnecessary. If you have >4500 features and only need 10, you might want to simplify your data pipeline elsewhere.