I have a pandas DataFrame with 86k rows, 5 features and 1 target column. I'm trying to train a DecisionTreeClassifier using 70% of the DataFrame as train data, and I get a MemoryError from the fit method. I've tried changing some of the parameters but I don't really know what's causing the error so I don't know how to handle it. I'm on Windows 10 with 8GB of RAM.
Code
train, test = train_test_split(data, test_size = 0.3)
X_train = train.iloc[:, 1:-1] # first column is not a feature
y_train = train.iloc[:, -1]
X_test = test.iloc[:, 1:-1]
y_test = test.iloc[:, -1]
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
dt_predictions = DT.predict(X_test)
Error
File (...), line 97, in <module>
DT.fit(X_train, y_train)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 362, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn\trewe\_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn\tree\_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn\tree\_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 671612928 bytes
Same error happens when I try the RandomForestClassifier, always in the line that does the fitting. How can I solve this?
I've been running into the same issue. Be sure you're dealing with a Classification problem and not a Regression problem. If your target column is continuous, you might want to use http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html instead of RandomForestClassifier.