Search code examples
pythonscikit-learnsparse-matrix

How to transform new input data using dictVectorize and using the model?


I am using dict vectorize to transform my categorical variables into sparse matrix. And then using Logistic Regression and Random Forest to train the model. My question is, next time when there is a new data comes in, how to transform it into the sparse matrix frame and then use the trained model to make prediction?

Here is a sample of my code:

dv_x, y = dictVectorizeData(inputData, header)
# dv_x is a <740051x1112 sparse matrix of type '<type 'numpy.float64'>'
# with 9620663 stored elements in Compressed Sparse Row format>

lr_cv = LogisticRegressionCV(penalty='l1', solver='liblinear', Cs=[10**i for i in range(-4,2)], cv=5, refit=True)
lr_cv.fit(dv_X, Y)

Now there is a new data, say in the format:

{
    'banner_position': '0',
    'connspeed': 'broadband',
    'creative_format': '728x90',
    'creative_id': '4688677',
    'day_hour_etc': '1',
    'domain': 'cdn.bitmedianetwork.com',
    'exch': 'cox',
    'home_bus': 'business',
    'is_mobile': 'non-mobile',
    'os_family': 'windows',
    'os_major': '8',
    'ua_family': 'ie',
    'ua_major': '9'
}

Solution

  • I assume that dictVectorizeData is a function you defined which calls sklearn.feature_extraction.DictVectorizer. To transform new data, you'll need access to this DictVectorizer instance.

    For example:

    from sklearn.feature_extraction import DictVectorizer
    vec = DictVectorizer()
    X = vec.fit_transform(input_data)
    
    from sklearn.linear_model import LogisticRegressionCV
    lr_cv = LogisticRegressionCV()
    lr_cv.fit(X, y_input)
    
    X_new = vec.transform(new_data)
    y_new = lr_cv.predict(X_new)
    

    Because it gets a bit tedious to always have to transform the inputs manually, it is often easier to create a pipeline to do this automatically:

    from sklearn.pipeline import make_pipeline
    pipe = make_pipeline(DictVectorizer(), LogisticRegressionCV())
    pipe.fit(input_data, y_input)
    y_new = pipe.predict(new_data)
    

    The y_new result here is equivalent to that in the first code block.