I am trying to map 13-dimensional input data to 3-dimensional output data by using RandomForest and GradientBoostingRegressor of scikit-learn. While for the RandomForest regressor this works fine, I get a ValueError for the GradientBoostingRegressor stating ValueError: y should be a 1d array, got an array of shape (16127, 3) instead.
I don't really understand why I get this error when using GradientBoostingRegressor and not when using the RandomForestRegressor. As far as I understand, both of them use decision trees as a weak learner and combine them to get a good result. Of course I know that I could transform the 3-dimensional output-labels to a 1-dimensional array but this does not make sense as i want to map to a 3-dimensional output-vector. Any idea how I can do this using the GradientBoostingRegressor?
Here is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Read data from csv files
Input_data_features = pd.read_csv("C:/Users/wi9632/Desktop/TestData_InputFeatures.csv", sep=';')
Input_data_labels = pd.read_csv("C:/Users/wi9632/Desktop/TestData_OutputLabels.csv", sep=';')
Input_data_features = Input_data_features.values
Input_data_labels = Input_data_labels.values
# standardize input features X and output labels Y
scaler_standardized_X = StandardScaler()
Input_data_features = scaler_standardized_X.fit_transform(Input_data_features)
scaler_standardized_Y = StandardScaler()
Input_data_labels = scaler_standardized_Y.fit_transform(Input_data_labels)
# Split dataset into train, validation, an test
index_X_Train_End = int(0.7 * len(Input_data_features))
index_X_Validation_End = int(0.9 * len(Input_data_features))
X_train = Input_data_features[0: index_X_Train_End]
X_valid = Input_data_features[index_X_Train_End: index_X_Validation_End]
X_test = Input_data_features[index_X_Validation_End:]
Y_train = Input_data_labels[0: index_X_Train_End]
Y_valid = Input_data_labels[index_X_Train_End: index_X_Validation_End]
Y_test = Input_data_labels[index_X_Validation_End:]
#Define a random forest model and train it
model_randomForest = RandomForestRegressor( )
model_randomForest.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_randomForest = model_randomForest.predict(X_test)
print(f"Random Forest Prediction: {Y_pred_randomForest}")
#Define a gradient boosting model and train it (-->Here I get the ValueError)
model_gradientBoosting = GradientBoostingRegressor( )
model_gradientBoosting.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_gradientBoosting = model_gradientBoosting.predict(X_test)
print(f"Gradient Boosting Prediction: {Y_pred_gradientBoosting}")
Here is the test data: https://filetransfer.io/data-package/ABCrGPzt#link
Reminder: As I could not solve my problem, I would like to remind you on this question. Does anybody have an idea how to tackle this problem?
RandomForestRegressor supports multi output regression, see docs. GradientBoostingRegressor does not.
You can use MultiOutputRegressor
+ GradientBoostingRegressor
for the problem. See this answer.
from sklearn.multioutput import MultiOutputRegressor
params = {'n_estimators': 5000, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 2}
estimator = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
estimator.fit(train_data,train_targets)