scikit-learn nlp classification random-forest text-classification

How to shrink a bag-of-words model?

The question title says it all: How can I make a bag-of-words model smaller? I use a Random Forest and a bag-of-words feature set. My model reaches 30 GB in size and I am sure that most words in the feature set do not contribute to the overall performance.

How to shrink a big bag-of-words model without losing (too much) performance?

Solution

Use feature selection. Feature selection removes features from your dataset based on their distribution with regards to your labels, using some scoring function.

Features that rarely occur, or occur randomly with all your labels, for example, are very unlikely to contribute to accurate classification, and get low scores.

Here's an example using sklearn:

from sklearn.feature_selection import SelectPercentile

# Assume some matrix X and labels y
# 10 means only include the 10% best features
selector = SelectPercentile(percentile=10)

# A feature space with only 10% of the features
X_new = selector.fit_transform(X, y)

# See the scores for all features
selector.scores

As always, be sure to only call fit_transform on your training data. When using dev or test data, only use transform. See here for additional documentation.

Note that there is also a SelectKBest, which does the same, but which allows you to specify an absolute number of features to keep, instead of a percentage.