Search code examples

How to predict correctly in sklearn RandomForestRegressor?

I'm working on a big data project for my school project. My dataset looks like this:

I'm trying to predict the next values of "LandAverageTemperature".

First, I've imported the csv into pandas and made it DataFrame named "df1".

After taking errors on my first tries in sklearn, I converted the "dt" column into datetime64 from string then added a column named "year" that shows only the years in the date values.-Its probably wrong-

df1["year"] = pd.DatetimeIndex(df1['dt']).year

After all of that, I prepared my data for reggression and called RandomForestReggressor:

landAvg = df1[["LandAverageTemperature"]]
year = df1[["year"]]

from sklearn.ensemble import RandomForestRegressor

print("Random forest:",rf_reg.predict(landAvg))

I ran the code and I've seen this result:

Random forest: [9.26558115 9.26558115 9.26558115 ... 9.26558115 9.26558115 9.26558115]

I'm not getting any errors but I don't think the results are correct -results are all the same as you can see-. Besides, when I want to get next 10 year's predictions, I don't know how to do that. I just get 1 result with this code. Can you help me for improve my code and get the right results? Thanks in advance for your help.


  • It's not enought to use only year to predict temperature. Your need to use month data too. Here is a working example for starters:

    import pandas as pd
    from sklearn.ensemble import RandomForestRegressor
    df = pd.read_csv('', usecols=['dt','LandAverageTemperature'], parse_dates=['dt'])
    df = df.dropna()
    df["year"] = df['dt'].dt.year
    df["month"] = df['dt'].dt.month
    X = df[["month", "year"]]
    y = df["LandAverageTemperature"]
    rf_reg=RandomForestRegressor(n_estimators=10,random_state=0), y)
    y_pred = rf_reg.predict(X)
    df_result = pd.DataFrame({'year': X['year'], 'month': X['month'], 'true': y, 'pred': y_pred})
    print('True values and predictions')
    print('Feature importances', list(zip(X.columns, rf_reg.feature_importances_)))

    And here is output:

    True values and predictions
          year  month    true     pred
    0     1750      1   3.034   2.2944
    1     1750      2   3.083   2.4222
    2     1750      3   5.626   5.6434
    3     1750      4   8.490   8.3419
    4     1750      5  11.573  11.7569
    ...    ...    ...     ...      ...
    3187  2015      8  14.755  14.8004
    3188  2015      9  12.999  13.0392
    3189  2015     10  10.801  10.7068
    3190  2015     11   7.433   7.1173
    3191  2015     12   5.518   5.1634
    [3180 rows x 4 columns]
    Feature importances [('month', 0.9543059863177156), ('year', 0.045694013682284394)]