python scikit-learn neural-network forecast mlp

MLP with Scikitlearn: Artificial Neural Network application for forecast

I have traffic data and I want to predict number of vehicles for the next hour by showing the model these inputs: this hour's number of vehicles and this hour's average speed value. Here is my code:

dataset=pd.read_csv('/content/final - Sayfa5.csv',delimiter=',') 
dataset=dataset[[ 'MINIMUM_SPEED', 'MAXIMUM_SPEED', 'AVERAGE_SPEED','NUMBER_OF_VEHICLES','1_LAG_NO_VEHICLES']]
X = np.array(dataset.iloc[:,1:4])
L = len(dataset)
Y = np.array([dataset.iloc[:,4]])
Y= Y[:,0:L]
Y = np.transpose(Y)

#scaling with MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X = scaler.transform(X)
 
scaler.fit(Y)
Y = scaler.transform(Y)
print(X,Y)

X_train , X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3)
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error 
mlp = MLPRegressor(activation='logistic')
mlp.fit(X_train,Y_train)
predictions = mlp.predict(X_test)
predictions1=mlp.predict(X_train)
print("mse_test :" ,mean_squared_error(Y_test,predictions), "mse_train :",mean_squared_error(Y_train,predictions1))

I got good mse values such as mse_test : 0.005467816018933008 mse_train : 0.005072774796622158

But I am confused in two point:

Should I scale y values, I read so many blog written that one should not to scale Ys, only scale the X_train and X_test. But I got so bad mse scores such as 49,50,100 or even more.
How can I get predictions for the future but not scaled values. For example I wrote:

    Xnew=[[ 80 , 40 , 47],
    [ 80 , 30,  81],
    [ 80 , 33, 115]]
    Xnew = scaler.transform(Xnew)
    print("prediction for that input is" , mlp.predict(Xnew))

But I got scaled values such as : prediction for that input is [0.08533431 0.1402755 0.19497315]

It should have been like this [81,115,102].

Solution

Congrats on using [sklearn's MLPRegressor][1], an introduction to Neural Networks is always a good thing.

Scaling your input data is critical for neural networks. Consider reviewing Chapter 11 of Etham Alpaydin's Introduction to Machine Learning. This is also put into great detail in the Efficient BackProp paper. To put it plainly, it is critical to scale the input data so that your model learns how to target an output.

In english, scaling in this case means converting your data into values between 0 and 1 (inclusive). A good Stats Exchange post on this describes the differences in scaling. For MinMax scaling, you are keeping the same distribution of your data, including being sensitive to outliers. More robust methods (described in that post) do exist in sklearn, such as RobustScaler.

So take for example a very basic dataset like this:

| Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 | Target |
|:---------:|:---------:|:---------:|:---------:|:---------:|:------:|
|     1     |     17    |     22    |     3     |     3     |   53   |
|     2     |     18    |     24    |     5     |     4     |   54   |
|     1     |     11    |     22    |     2     |     5     |   96   |
|     5     |     20    |     22    |     7     |     5     |   59   |
|     3     |     10    |     26    |     4     |     5     |   66   |
|     5     |     14    |     30    |     1     |     4     |   63   |
|     2     |     17    |     30    |     9     |     5     |   93   |
|     4     |     5     |     27    |     1     |     5     |   91   |
|     3     |     20    |     25    |     7     |     4     |   70   |
|     4     |     19    |     23    |     10    |     4     |   81   |
|     3     |     13    |     8     |     19    |     5     |   14   |
|     9     |     18    |     3     |     67    |     5     |   35   |
|     8     |     12    |     3     |     34    |     7     |   25   |
|     5     |     15    |     6     |     12    |     6     |   33   |
|     2     |     13    |     2     |     4     |     8     |   21   |
|     4     |     13    |     6     |     28    |     5     |   46   |
|     7     |     17    |     7     |     89    |     6     |   21   |
|     4     |     18    |     4     |     11    |     8     |    5   |
|     9     |     19    |     7     |     21    |     5     |   30   |
|     6     |     14    |     6     |     17    |     7     |   73   |

I can slightly modify your code to play with this:

import pandas as pd, numpy as np
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_squared_error 

df = pd.read_clipboard()

# Build data
y = df['Target'].to_numpy()
scaled_y = df['Target'].values.reshape(-1, 1) #returns a numpy array
df.drop('Target', inplace=True, axis=1)
X = df.to_numpy()

#scaling with RobustScaler
scaler = RobustScaler()
X = scaler.fit_transform(X)

# Scaling y just to show you the difference
scaled_y = scaler.fit_transform(scaled_y)

# Set random_state so we can replicate results
X_train , X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=8)
scaled_X_train , scaled_X_test, scaled_y_train, scaled_y_test = train_test_split(X,scaled_y,test_size=0.2, random_state=8)

mlp = MLPRegressor(activation='logistic')
scaled_mlp = MLPRegressor(activation='logistic')

mlp.fit(X_train, y_train)
scaled_mlp.fit(scaled_X_train, scaled_y_train)

preds = mlp.predict(X_test)
scaled_preds = mlp.predict(scaled_X_test)

for pred, scaled_pred, tar, scaled_tar in zip(preds, scaled_preds, y_test, scaled_y_test):
    print("Regular MLP:")
    print("Prediction: {} | Actual: {} | Error: {}".format(pred, tar, tar-pred))
    
    print()
    print("MLP that was shown scaled labels: ")
    print("Prediction: {} | Actual: {} | Error: {}".format(scaled_pred, scaled_tar, scaled_tar-scaled_pred))

In short, shrinking your target will naturally shrink your error, since your model is not learning the actual value, but the value between 0 and 1.

That is why we do not scale our target variable, since the error is naturally smaller since we are forcing the values into a 0...1 space.