I want to run a random forest classifier model. My data set is pretty big with 1 million rows and 300 columns. Of course, prefer not to run the model for like 3 days non-stop. So I was wondering if there are some good practices to find the optimal trade-off between running time and prediction quality.
Here are some examples of what a was thinking:
Can I use a random subsample of x rows to tune the parameters and then use does parameters for the model with all the data. (If yes how do I find the best value for x?)
Is there a way to know at what point it is useless to keep adding more data because the prediction will stop improving? (i.e., what is the minimum number of rows that will give me the best results for the running time)
How can I estimate the running time of the model? With 4000 rows the model takes 4 min with 8000 it takes 10 min. The running time is exponential or it's more or less linear and I could expect 1280min of running time with 1 million rows?
Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
This way you'll be able to plot the amount of the data vs the model performance. Here are some examples of plotting: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py
Some additional practical suggestions:
n_jobs=-1
to run the model in parallel on all cores;