Search code examples
pythonscikit-learnrandom-forestpredictionpanel-data

Random Forest on Panel Data using Python


So I am having some troubles running a random forest regression on panel data.

The data currently looks like this:

enter image description here

I want to conduct a random forest regression which predicts KwH for each ID over time based on the variables I have. I have split my data into training and test samples using the following code:

from sklearn.model_selection import train_test_split
X = df[['hour', 'day', 'month', 'dayofweek', 'apparentTemperature',
       'summary', 'household_size', 'work_from_home', 'num_rooms',
       'int_in_renew', 'int_in_gen', 'conc_abt_cc', 'feel_abt_lifestyle',
       'smrt_meter_help', 'avg_gender', 'avg_age', 'house_type', 'sum_insul',
       'total_lb', 'total_fridges', 'bigg_apps', 'small_apps',
       'look_at_meter']]
y = df[['KwH']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

I then wish to train my model and test it against the testing sample however I am unsure of how to do this. I have tried this code:

from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=200)
rfc.fit(X_train, y_train)

However I get the following error message:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Im not sure if the error is fundamentally in the way my data is arranged or the way I am doing the random forest so any help with this and then testing the data against the test sample after would be greatly appreciated.

Thanks in advance.


Solution

  • Simply switching y = df[['KwH']] to y = df['KwH'] or y = df.KwH should solve this.

    This is because scikit-learn doesn't expect y to be a dataframe, and selecting columns with the double [[...]] precisely is returning a dataframe.