machine-learning optimization random-forest

Best practices to run a random forest model as fast as possible

I want to run a random forest classifier model. My data set is pretty big with 1 million rows and 300 columns. Of course, prefer not to run the model for like 3 days non-stop. So I was wondering if there are some good practices to find the optimal trade-off between running time and prediction quality.

Here are some examples of what a was thinking:

Can I use a random subsample of x rows to tune the parameters and then use does parameters for the model with all the data. (If yes how do I find the best value for x?)

Is there a way to know at what point it is useless to keep adding more data because the prediction will stop improving? (i.e., what is the minimum number of rows that will give me the best results for the running time)

How can I estimate the running time of the model? With 4000 rows the model takes 4 min with 8000 it takes 10 min. The running time is exponential or it's more or less linear and I could expect 1280min of running time with 1 million rows?

Solution

Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )

This way you'll be able to plot the amount of the data vs the model performance. Here are some examples of plotting: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py

https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py

Estimating total time is difficult, because it isn't linear.

Some additional practical suggestions:

set n_jobs=-1 to run the model in parallel on all cores;
use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.