Search code examples
pythonscikit-learnpipeline

How to port feature pipeline from scikit-learn V0.21 to V0.24


I am trying to port a sklearn feature pipeline trained in scikit-learn V0.21 to scikit-learn V0.24, because I do not have the original feature data to train the pipeline again. If I use new data, the feature dimension and position may be off from the following model, as I have DictVectorizer in the pipeline.

I've tried to use pickle and joblib to serialize the pipeline in V0.21 and then deserialize it in V0.24. Unfortunately, in both cases, the code raised ModuleNotFoundError: No module named 'sklearn.feature_extraction.dict_vectorizer' error when loading in V0.24.

I created the pipeline with the same code using V0.21 and V0.24 respectively. When printing them out, they show some minor difference.

In V0.21

Pipeline(memory=None,
         steps=[('selector', ItemSelector(key='hsd_feature_map')),
                ('dv1',
                 DictVectorizer(dtype=<class 'numpy.float64'>, separator='=',
                                sort=True, sparse=False)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=True,
                                  use_idf=True)),
                ('max', MaxAbsScaler(copy=True))],
         verbose=False)

In V0.24

Pipeline(steps=[('selector', ItemSelector(key='hsd_feature_map')),
                ('dv1', DictVectorizer(sparse=False)),
                ('tfidf', TfidfTransformer(sublinear_tf=True)),
                ('max', MaxAbsScaler())])

I wonder if there is anyway to transfer the feature pipeline or its parameters from scikit-learn V0.21 to V0.24.


Solution

  • From sklearn version 0.22.X DictVectorizer import changed from

    sklearn/feature_extraction/dict_vectorizer.py
    

    to

    sklearn/feature_extraction/_dict_vectorizer.py
    

    I think you could override the DictVectorizer import according to this answer