Search code examples

How to normalize data in a python array by column using SKLearn?

I am coding a machine learning algorithm using Keras and I need to normalize my data before feeding it through. I have 3 inputs organised into a 2d array with each column making up an input.

    import tensorflow as tf
    import keras
    import numpy as np
    from keras.models import Sequential
    from keras.layers import Dense, Activation, Dropout
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import MinMaxScaler
    #Importing all the required modules

    raw_data = np.array([]) #Defining numpy array for training data
    val_data = np.array([]) #Defining numpy array for validation data
    test = np.array([]) #Defining numpy array for test data
    rawfilepath = r'C:\Users\***\Desktop\***\Unprocessed_Data_For_Training.txt'
    valfilepath = r'C:\Users\***\Desktop\***\Unprocessed_Data_For_Validation.txt'
    testfilepath = r'C:\Users\***\Desktop\***\h4t6usedforprediction.txt' #Filepaths 
    raw_data = np.loadtxt(rawfilepath)
    val_data = np.loadtxt(valfilepath)
    test = np.loadtxt(testfilepath) #Loading contents of text files into their respective arrays
    X = raw_data[:, 1:4] #Splitting the data, X contains the coordinate position, initial shear and initial  
    Y = raw_data[:, 0] #Splitting the data, Y contains the measured height
    X_Val = val_data[:, 1:4]
    Y_Val = val_data[:, 0]
    X_test = test[:, 1:4]
    Y_test = test[:, 0]
    scalar = MinMaxScaler()

    scaler = MinMaxScaler()
    Xnorm = scaler.fit_transform(X) 
    Ynorm = scaler.fit_transform(Y.reshape(-1,1))
    Xvalnorm = scaler.fit_transform(X_Val)
    Yvalnorm = scaler.fit_transform(Y_Val.reshape(-1,1))
    Xtestnorm = scaler.fit_transform(X_test)
    Ytestnorm = scaler.fit_transform(Y_test.reshape(-1,1))

The Y variables are normalising fine however I think the X variables are normalising with the whole array rather than column by column.

These are the inputs that the model is using to make predictions.

X=[0.94941569 0.         0.        ], Predicted=[0.02409407]
X=[0.95664225 0.         0.        ], Predicted=[0.02374389]
X=[0.93496738 0.         0.        ], Predicted=[0.02480936]
X=[0.94219233 0.         0.        ], Predicted=[0.02444912]
X=[0.92774402 0.         0.        ], Predicted=[0.02517468]
X=[0.92052067 0.         0.        ], Predicted=[0.02554525]
X=[0.91329892 0.         0.        ], Predicted=[0.02592104]
X=[0.90607877 0.         0.        ], Predicted=[0.02630214]
X=[0.89885863 0.         0.        ], Predicted=[0.02668868]
X=[0.89163848 0.         0.        ], Predicted=[0.02708073]
X=[0.88441994 0.         0.        ], Predicted=[0.0274783]
X=[0.87720299 0.         0.        ], Predicted=[0.02788144]


  • Let's do this by part:

    1 - If Xand Y are you train set, calling fit_transform in that set is correct. But you can not fit_transform your validationand test sets again. You have to just transform them using the scaleryou have previously defined:

    scaler = MinMaxScaler()
    Xnorm = scaler.fit_transform(X) 
    Ynorm = scaler.fit_transform(Y.reshape(-1,1))
    Xvalnorm = scaler.transform(X_Val)
    Yvalnorm = scaler.transform(Y_Val.reshape(-1,1))
    Xtestnorm = scaler.transform(X_test)
    Ytestnorm = scaler.transform(Y_test.reshape(-1,1))

    2 - I am assuming the values of X you have posted at the end are already what you got from the normalization. So, i have created my_X just to exemplify to use sklearn to normalize some data:

    my_X = np.array([[-3, 2, 4], [-6, 4, 1], [0, 10, 15], [12, 18, 31]])
    scaler = MinMaxScaler()

    Just change the values my_X for the values you have in your X.