Search code examples

Memory error for decision tree with 66k features, using scikit python packages

Problem Statement

I am using a document of 1600000 lines and ~66k features. I am using the bag of words approach to build a decision tree. Following code is working fine for 1000 line document. But throws memory error for the actual 1600000 line document. My Server has a 64GB of RAM.

Instead of using .todense() or .toarray(), is there any way to use the sparse matrix directly ? OR Is there any options to reduce the default type float64? Kindly help me on this.


vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(corpus)

clf = tree.DecisionTreeClassifier()
clf =,corpus2)


Traceback (most recent call last):
  File "", line 103, in <module>
    clf =,corpus2)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/", line 458, in todense
    return np.asmatrix(self.toarray())
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/", line 550, in toarray
    return self.tocoo(copy=False).toarray()
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/", line 219, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)

In short, is there any methods to use classification tree for large data set with 66k features.?


  • Add dtype=np.float32 eg: vec = TfidfVectorizer(..., dtype=np.float32)

    As for sparse/dense I have similar problem. GradientBoostingClassifier, RandomForestClassifier or DecisionTreeClassifier need dense data, for that reason I use SVC.