python scikit-learn categorical-data dictvectorizer

Why would DictVectorizer change the number of features?

I have a dataset of 324 rows and 35 columns. I split it into training and testing data:

X_train, X_test, y_train, y_test = train_test_split(tempCSV[feaure_names[0:34]], tempCSV[feaure_names[34]], test_size=0.2, random_state=32)

This seems to work fine, and my X_train and X_test both have 34 features. I apply some further transformations with DictVectorizer because I have categorical variables.

from sklearn.feature_extraction import DictVectorizer
vecS=DictVectorizer(sparse=False)
X_train=vecS.fit_transform(X_train.to_dict(orient='record'))
X_test=vecS.fit_transform(X_test.to_dict(orient='record'))

Now when I compare X_train to X_test, the former has 46 features, and the latter only has 44. What are some possible reasons this could happen?

Solution

Because you are vectorizing using a different fit. When you use fit_transform:

X_train=vecS.fit_transform(X_train.to_dict(orient='record'))
X_test=vecS.fit_transform(X_test.to_dict(orient='record'))

That results in two differently fitted vectorizers acting on your data-sets. The first will be trained on all the features in X_train.to_dict and the other in all the features in X_test.to_dict, You want to train the vectorizer once on your training data and then only use transform, because fit_transform refits:

X_train=vecS.fit_transform(X_train.to_dict(orient='record'))
X_test=vecS.transform(X_test.to_dict(orient='record'))

Note, your model will only ever know about features from your training set.