I am using dict vectorize to transform my categorical variables into sparse matrix. And then using Logistic Regression and Random Forest to train the model. My question is, next time when there is a new data comes in, how to transform it into the sparse matrix frame and then use the trained model to make prediction?
Here is a sample of my code:
dv_x, y = dictVectorizeData(inputData, header)
# dv_x is a <740051x1112 sparse matrix of type '<type 'numpy.float64'>'
# with 9620663 stored elements in Compressed Sparse Row format>
lr_cv = LogisticRegressionCV(penalty='l1', solver='liblinear', Cs=[10**i for i in range(-4,2)], cv=5, refit=True)
lr_cv.fit(dv_X, Y)
Now there is a new data, say in the format:
{
'banner_position': '0',
'connspeed': 'broadband',
'creative_format': '728x90',
'creative_id': '4688677',
'day_hour_etc': '1',
'domain': 'cdn.bitmedianetwork.com',
'exch': 'cox',
'home_bus': 'business',
'is_mobile': 'non-mobile',
'os_family': 'windows',
'os_major': '8',
'ua_family': 'ie',
'ua_major': '9'
}
I assume that dictVectorizeData
is a function you defined which calls sklearn.feature_extraction.DictVectorizer
. To transform new data, you'll need access to this DictVectorizer
instance.
For example:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X = vec.fit_transform(input_data)
from sklearn.linear_model import LogisticRegressionCV
lr_cv = LogisticRegressionCV()
lr_cv.fit(X, y_input)
X_new = vec.transform(new_data)
y_new = lr_cv.predict(X_new)
Because it gets a bit tedious to always have to transform the inputs manually, it is often easier to create a pipeline to do this automatically:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(DictVectorizer(), LogisticRegressionCV())
pipe.fit(input_data, y_input)
y_new = pipe.predict(new_data)
The y_new
result here is equivalent to that in the first code block.