Search code examples
pythonpandasnumpymachine-learningdecision-tree

How to filter non-zero importance features from sparse matrix?


I have a dataset where most of the columns have text values. So I used tfidf and count vectorizers for converting this dataset into vector form. As, a result I got a sparse matrix. I applied Decision tree algorithm and I got the expected results. Now, I want to prepare another model where I use only those features that have non-zero feature importance. But, am not able to filter those features that have non-zero importance.

X_tr
<65548x3101 sparse matrix of type '<class 'numpy.float64'>'
    with 7713590 stored elements in Compressed Sparse Row format>

Here, X_tr is my training dataset.

X_tr.shape
(65548, 3101)

dtc.feature_importances_.shape
(3101,)

Here, 'dtc' is my decision tree classifier model.

My question is, how can I get another sparse matrix which contains only those feature where feature importance is a non-zero value ?


Solution

  • I think this should be as simple as:

    X_tr[:, dtc.feature_importances_ != 0]