machine-learning scikit-learn feature-extraction categorical-data

Encode a categorical feature with multiple categories per example

I am working on a dataset which has a feature that has multiple categories for a single example. The feature looks like this:-

                              Feature
0   [Category1, Category2, Category2, Category4, Category5]
1                     [Category11, Category20, Category133]
2                                    [Category2, Category9]
3                [Category1000, Category1200, Category2000]
4                                              [Category12]

The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn

Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.

Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.

Solution

Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.

Assuming the MultiLabelBinarizered 2000 features = X

from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)

And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.

If you want to understand how well the new smaller feature space describes the variance you could use the following command

model.explained_variance_