I'm using a very simple csv file that I downloaded from the Internet, with only two columns. The first column is "MonthsExperience" and it goes like "3, 3, 4, 4, 5, 6..." and the second column is like "424, 387, 555, 59, 533...".
I'm trying to get the cross_val_score of the RandomForestRegressor model on simple linear regression for the sake of training.
Here's the code:
import numpy as np
import pandas as pd
data = pd.read_csv("Blogging_Income.csv")
X = data["MonthsExperience"]
y = data["Income"]
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
from sklearn.model_selection import cross_val_score
cv_r2 = cross_val_score(rfr, X, y, cv = 5, scoring = None)
print(cv_r2)
I get a long white warning from sklearn, pointing that all the results are turned to NaN because the model couldn't fit. The upper part of the warning/error I get is like this:
[nan nan nan nan nan]
C:\Users\----\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\----\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\----\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 304, in fit
X, y = self._validate_data(X, y, multi_output=True,
File "C:\Users\----\anaconda3\lib\site-packages\sklearn\base.py", line 433, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 871, in check_X_y
X = check_array(X, accept_sparse=accept_sparse,
File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\----\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 694, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[ 6. 6. 7. 8. 8. 9. 9. 10. 11. 11. 12. 12. 12. 13. 13. 14. 14. 15.
15. 16. 16. 17. 18. 18.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
It appears like the array is in wrong shape but I don't understand why. I also don't understand how I could use array.reshape to make this work.
RandomForest, similarly to any other machine learning model, requires your data to be 2D. Even if you have just one feature, your X has to be N x 1, instead of a vector of length N.
You can reshape your data using numpy
X = np.array(X).reshape(-1, 1)