Search code examples
machine-learningscikit-learnregressionrandom-forest

Prediction using sklearn's RandomForestRegressor


Here's what my data looks like...

date,locale,category,site,alexa_rank,sessions,user_logins
20170110,US,1,google,1,500,5000
20170110,EU,1,google,2,400,2000
20170111,US,2,facebook,2,400,2000

... and so on. This is just a toy dataset I came up with, but which resembles the original data.

I'm trying to build a model to predict how many user logins and sessions a particular site would have, using sklearn's RandomForestRegressor.

I do the usual stuff, encoding categories to labels and I've trained my model on the first eight months of the year and now I'd like to predict logins and sessions for the ninth month. I've created one model trained on logins and another one trained on sessions.

My test dataset is of the same form:

date,locale,category,site,alexa_rank,sessions,user_logins
20170910,US,1,google,1,500,5000
20170910,EU,1,google,2,400,2000
20170911,US,2,facebook,2,400,2000

Ideally I'd like to pass in the test dataset without the columns I need predicted, but RandomForestRegressor complains about the dimensions being different between the training and test set.

When I pass the test dataset in its current form, the model predicts the exact values in the sessions and user_logins columns in most cases and values with tiny variations otherwise.

I zeroed out the sessions and user_logins columns in the test data and passed it to the model but the model predicted nearly all zeroes.

  • Is my workflow correct? Am I using RandomForestRegressor correctly?
  • How am I getting so close to the actual values when my test dataset does contain actual values? Are the actual values in the test data being used in the prediction?
  • If the model works correctly, shouldn't I be getting the same values predicted if I zero out the columns I'm looking to predict (sessions and user_logins)?

Solution

  • You shouldn't pass the column you want to predict in the test data. You workflow is not correct.

    If X is the set of columns of the information you have, if y is the set of columns of the information you want to predict, then you should pass (X_train, y_train) during the training (using the method fit), and (X_test, ) only during the testing (using the method predict). You will obtain y_pred that you can compare with y_test if you have it.

    In your example, if you want to predict user_logins:

    X_train = array([['20170110', 'US', '1', 'google', '1', '500'],
                     ['20170110', 'EU', '1', 'google', '2', '400'],
                     ['20170111', 'US', '2', 'facebook', '2', '400']],
                    dtype='<U21')
    y_train = array(['5000', '2000', '2000'], dtype='<U21')
    
    X_test = array([['20170112', 'EU', '2', 'google', '1', '500'],
                    ['20170113', 'US', '1', 'facebook', '2', '400'],
                    ['2017014', 'US', '2', 'google', '1', '500']],
                   dtype='<U21')
    
    estimator = RandomForestRegressor().fit(X_train, y_train)
    y_pred = estimator.predict(X_test)
    

    Take a look at the documentation for more examples, or at the tutorials.