Search code examples
pythonscikit-learnboosting

How to use a GradientBoostingRegressor in scikit-learn with 3 output dimensions


I am trying to map 13-dimensional input data to 3-dimensional output data by using RandomForest and GradientBoostingRegressor of scikit-learn. While for the RandomForest regressor this works fine, I get a ValueError for the GradientBoostingRegressor stating ValueError: y should be a 1d array, got an array of shape (16127, 3) instead.

I don't really understand why I get this error when using GradientBoostingRegressor and not when using the RandomForestRegressor. As far as I understand, both of them use decision trees as a weak learner and combine them to get a good result. Of course I know that I could transform the 3-dimensional output-labels to a 1-dimensional array but this does not make sense as i want to map to a 3-dimensional output-vector. Any idea how I can do this using the GradientBoostingRegressor?

Here is my code:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Read data from  csv files
Input_data_features = pd.read_csv("C:/Users/wi9632/Desktop/TestData_InputFeatures.csv", sep=';')
Input_data_labels = pd.read_csv("C:/Users/wi9632/Desktop/TestData_OutputLabels.csv", sep=';')
Input_data_features = Input_data_features.values
Input_data_labels = Input_data_labels.values


# standardize input features X and output labels Y
scaler_standardized_X = StandardScaler()
Input_data_features = scaler_standardized_X.fit_transform(Input_data_features)

scaler_standardized_Y = StandardScaler()
Input_data_labels = scaler_standardized_Y.fit_transform(Input_data_labels)


# Split dataset into train, validation, an test
index_X_Train_End = int(0.7 * len(Input_data_features))
index_X_Validation_End = int(0.9 * len(Input_data_features))

X_train = Input_data_features[0: index_X_Train_End]
X_valid = Input_data_features[index_X_Train_End: index_X_Validation_End]
X_test = Input_data_features[index_X_Validation_End:]

Y_train = Input_data_labels[0: index_X_Train_End]
Y_valid = Input_data_labels[index_X_Train_End: index_X_Validation_End]
Y_test = Input_data_labels[index_X_Validation_End:]


#Define a random forest model and train it
model_randomForest = RandomForestRegressor( )
model_randomForest.fit(X_train, Y_train)

#Predict the test data with Random Forest
Y_pred_randomForest = model_randomForest.predict(X_test)
print(f"Random Forest Prediction: {Y_pred_randomForest}")


#Define a gradient boosting  model and train it (-->Here I get the ValueError)
model_gradientBoosting = GradientBoostingRegressor( )
model_gradientBoosting.fit(X_train, Y_train)

#Predict the test data with Random Forest
Y_pred_gradientBoosting = model_gradientBoosting.predict(X_test)
print(f"Gradient Boosting Prediction: {Y_pred_gradientBoosting}")

Here is the test data: https://filetransfer.io/data-package/ABCrGPzt#link

Reminder: As I could not solve my problem, I would like to remind you on this question. Does anybody have an idea how to tackle this problem?


Solution

  • RandomForestRegressor supports multi output regression, see docs. GradientBoostingRegressor does not.

    You can use MultiOutputRegressor + GradientBoostingRegressor for the problem. See this answer.

    from sklearn.multioutput import MultiOutputRegressor
    params = {'n_estimators': 5000, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 2}
    
    estimator = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
    estimator.fit(train_data,train_targets)