Search code examples
pythonpython-3.xscikit-learnregressioniris-dataset

Why is scaling the iris dataset making the MAE much worse?


This code is predicting sepal length from the iris dataset, and it is getting a MAE of around .94

from sklearn import metrics
from sklearn.neural_network import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:, 1:]
y = iris.data[:, 0]  # sepal length

X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = MLPRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(metrics.mean_absolute_error(y_test, y_pred))

Though when I remove the scaling lines

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

the MAE goes down to .33. Am I scaling wrong, and why is the scaling making the error so much higher?


Solution

  • Interesting question. So let's test (putting random states for reproducible results where appropriate) non (sklearn.neural_network.MLPRegressor) neural net approach with and without scaling:

    from sklearn import metrics
    from sklearn.neural_network import *
    from sklearn.model_selection import *
    from sklearn.preprocessing import *
    from sklearn import datasets
    import numpy as np
    from sklearn.linear_model import LinearRegression
    
    iris = datasets.load_iris()
    X = iris.data[:, 1:]
    y = iris.data[:, 0]  # sepal length
    
    
    ### pur random state for reproducibility
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1989)
    
    
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    pred = lr.predict(X_test)
    
    # Evaluating Model's Performance
    print('Mean Absolute Error NO SCALE:', metrics.mean_absolute_error(y_test, pred))
    print('Mean Squared Error NO SCALE:', metrics.mean_squared_error(y_test, pred))
    print('Mean Root Squared Error NO SCALE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    
    ### put random state for reproducibility
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1989)
    
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    pred = lr.predict(X_test)
    
    # Evaluating Model's Performance
    print('Mean Absolute Error YES SCALE:', metrics.mean_absolute_error(y_test, pred))
    print('Mean Squared Error YES SCALE:', metrics.mean_squared_error(y_test, pred))
    print('Mean Root Squared Error YES SCALE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
    
    

    Gives:

    Mean Absolute Error NO SCALE: 0.2789437424421388
    Mean Squared Error NO SCALE: 0.1191038134603132
    Mean Root Squared Error NO SCALE: 0.3451142035041635
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Mean Absolute Error YES SCALE: 0.27894374244213865
    Mean Squared Error YES SCALE: 0.11910381346031311
    Mean Root Squared Error YES SCALE: 0.3451142035041634
    

    Ok. Looks like you are doing everything right when it comes to scaling, but dealing with neural nets has many nuances and on top of that what may work for one architecture may not work for another, so when possible experimentation will show the best approach.


    Running your code also gives the following error: _multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (100) reached and the optimization hasn't converged yet. warnings.warn(

    So your algorithm doesnt converge and hence your MAE is high. It is optimizing in steps and 100 wasn't enough, so iterations must be increased in order to finish your training and decrease MAE.


    Additionally, because of the way error is propagated to weights during training big spread in targets may result in large gradients causing drastic changes in weights making training unstable or not converge at all.

    Overall NNs TEND to perform best when inputs are on a common scale and TEND to train faster (max_iter parameter here, see below). We will check that next...

    On top of that! Types of transforms may matter too, standardization vs normalization and types within which. For example in RNNs scaling from -1 to 1 TENDS to perform better than 0 - 1.


    Lets run MLPRegressor experiments next

    ### DO IMPORTS
    from sklearn import metrics
    from sklearn.neural_network import *
    from sklearn.model_selection import *
    from sklearn.preprocessing import *
    from sklearn import datasets
    import numpy as np
    
    ### GET DATASET
    iris = datasets.load_iris()
    X = iris.data[:, 1:]
    y = iris.data[:, 0]  # sepal length
    

    #########################################################################################
    # SCALE INPUTS = NO
    # SCALE TARGETS = NO
    #########################################################################################
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)
    
    
    # put random state here as well because of the way NNs get set up there is randomization within initial parameters
    # max iterations for each were found manually but you can also use grid search because its basically a hyperparameter
    
    model = MLPRegressor(random_state = 100,max_iter=450)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print('----------------------------------------------------------------------')
    print("SCALE INPUTS =  NO & SCALE TARGETS = NO")
    print('----------------------------------------------------------------------')
    print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred))
    print('Squared Error', metrics.mean_squared_error(y_test,  y_pred))
    print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred)))
    
    ----------------------------------------------------------------------
    SCALE INPUTS =  NO & SCALE TARGETS = NO
    ----------------------------------------------------------------------
    Mean Absolute Error 0.25815648734192126
    Squared Error 0.10196864342576142
    Mean Root Squared Error 0.319325294058835
    

    #########################################################################################
    # SCALE INPUTS = YES
    # SCALE TARGETS = NO
    #########################################################################################
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    model = MLPRegressor(random_state = 100,max_iter=900)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print('----------------------------------------------------------------------')
    print("SCALE INPUTS = YES & SCALE TARGETS = NO")
    print('----------------------------------------------------------------------')
    print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred))
    print('Squared Error', metrics.mean_squared_error(y_test,  y_pred))
    print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred)))
    
    ----------------------------------------------------------------------
    SCALE INPUTS = YES & SCALE TARGETS = NO
    ----------------------------------------------------------------------
    Mean Absolute Error 0.2699225498998305
    Squared Error 0.1221046275841224
    Mean Root Squared Error 0.3494347257845482
    

    #########################################################################################
    # SCALE INPUTS = NO
    # SCALE TARGETS = YES
    #########################################################################################
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)
    
    scaler_y = StandardScaler()
    y_train = scaler_y.fit_transform(y_train.reshape(-1, 1))
    
    ### NO NEED TO RESCALE since network doesnt see it
    # y_test = scaler_y.transform(y_test.reshape(-1, 1))
    
    model = MLPRegressor(random_state = 100,max_iter=500)
    model.fit(X_train, y_train.ravel())
    y_pred = model.predict(X_test)
    
    ### rescale predictions back to y_test scale
    y_pred_rescaled_back = scaler_y.inverse_transform(y_pred.reshape(-1, 1))
    
    print('----------------------------------------------------------------------')
    print("SCALE INPUTS = NO & SCALE TARGETS = YES")
    print('----------------------------------------------------------------------')
    print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred_rescaled_back))
    print('Squared Error', metrics.mean_squared_error(y_test,  y_pred_rescaled_back))
    print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred_rescaled_back)))
    
    ----------------------------------------------------------------------
    SCALE INPUTS = NO & SCALE TARGETS = YES
    ----------------------------------------------------------------------
    Mean Absolute Error 0.23602139631237182
    Squared Error 0.08762790909543768
    Mean Root Squared Error 0.29602011603172795
    

    #########################################################################################
    # SCALE INPUTS = YES
    # SCALE TARGETS = YES
    #########################################################################################
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)
    
    scaler_x = StandardScaler()
    scaler_y = StandardScaler()
    
    X_train = scaler_x.fit_transform(X_train)
    X_test = scaler_x.transform(X_test)
    
    y_train = scaler_y.fit_transform(y_train.reshape(-1, 1))
    ### NO NEED TO RESCALE since network doesnt see it
    # y_test = scaler_y.transform(y_test.reshape(-1, 1))
    
    model = MLPRegressor(random_state = 100,max_iter=250)
    model.fit(X_train, y_train.ravel())
    y_pred = model.predict(X_test)
    
    ### rescale predictions back to y_test scale
    y_pred_rescaled_back = scaler_y.inverse_transform(y_pred.reshape(-1, 1))
    
    print('----------------------------------------------------------------------')
    print("SCALE INPUTS = YES & SCALE TARGETS = YES")
    print('----------------------------------------------------------------------')
    print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred_rescaled_back))
    print('Squared Error', metrics.mean_squared_error(y_test,  y_pred_rescaled_back))
    print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred_rescaled_back)))
    
    ----------------------------------------------------------------------
    SCALE INPUTS = YES & SCALE TARGETS = YES
    ----------------------------------------------------------------------
    Mean Absolute Error 0.2423901612747137
    Squared Error 0.09758236232324796
    Mean Root Squared Error 0.3123817573470768
    

    To summarize:
    enter image description here

    So looks like with this particular way of scaling for this particular architecture and dataset you converge the fastest with scaled inputs and scaled targets, but in the process probably lose some information (with this particular transform) that's useful in predictions and so your MAE is slightly higher than when you dont scale inputs but scale targets for example.


    Even here however I think for example changing learning rate hyperparameter (within MLPRegressor) value can help converge faster when for example values are not scaled, but would need to experiment with that as well... As you can see... Many nuances indeed.


    PS Some good discussions on this topic