Search code examples
pythonmachine-learningscikit-learnrandom-forestcross-validation

How to predict with the test dataset while using cross validation?


I would like to use Cross Validation for the prediction model. I would like to keep 20% of my data as a test set, and use the rest of my data to fit my model with Cross Validation.

It would like to be as following :

enter image description here

And as a machine learning model, I would like to use Random Forest and LightGBM.

from sklearn.ensemble import RandomForestRegressor 
random_forest = RandomForestRegressor (n_estimators=1400, max_depth=80, max_features='sqrt',
                                   min_samples_leaf=1, min_samples_split=5, 
                                   random_state=1, verbose=1, n_jobs=-1)

from sklearn.model_selection import cross_val_score
scores = cross_val_score(random_forest, X_train, y_train, cv=5, scoring = 'r2')

It gives the result, but I want to predict the y values of X_test data. Could you please help me for it? After that, I will create a model for LightGBM as well.


Solution

  • Generally speaking, cross-validation (CV) is used for one of the following two reasons:

    • Model tuning (i.e. hyperparameter search), in order to search for the hyperparameters that maximize the model performance; in scikit-learn, this is usually accomplished using the GridSearchCV module
    • Performance assessment of a single model, where you are not interested in selecting the hyperparameters of your model; this is normally achieved with cross_val_score

    From your setting, it is clear that you are in the second case above: for whatever reason, you seem to have concluded that the hyperparameters to be used are the ones you show in the definition of your model, and, before proceeding to fit it, you want an indication of how well it performs. You have chosen to do so using cross_val_score, and your shown code is indeed fine up to this point.

    But you have not finished: cross_val_score does only that, i.e. it returns a score, it does not return a fitted model. So, in order to actually fit your model and get predictions on your test set (assuming of course that you are satisfied by the actual score returned by cross_val_score), you need to proceed in doing so as:

    random_forest.fit(X_train, y_train)
    pred = random_forest.predict(X_test) 
    

    And the procedure should be similar for LightGBM as well.