Problem Statement
I am using a document of 1600000 lines and ~66k features. I am using the bag of words approach to build a decision tree. Following code is working fine for 1000 line document. But throws memory error for the actual 1600000 line document. My Server has a 64GB of RAM.
Instead of using .todense()
or .toarray()
, is there any way to use the sparse matrix directly ? OR
Is there any options to reduce the default type float64?
Kindly help me on this.
Code:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(corpus)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train.todense(),corpus2)
Error:
Traceback (most recent call last):
File "test123.py", line 103, in <module>
clf = clf.fit(X_train.todense(),corpus2)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 458, in todense
return np.asmatrix(self.toarray())
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
In short, is there any methods to use classification tree for large data set with 66k features.?
Add dtype=np.float32 eg: vec = TfidfVectorizer(..., dtype=np.float32)
As for sparse/dense I have similar problem. GradientBoostingClassifier, RandomForestClassifier or DecisionTreeClassifier need dense data, for that reason I use SVC.