Search code examples
pythonmachine-learningscikit-learnnormalizationsklearn-pandas

How to normalize the Train and Test data using MinMaxScaler sklearn


So, I have this doubt and have been looking for answers. So the question is when I use,

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()

df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})

df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)

After which I will train and test the model (A,B as features, C as Label) and get some accuracy score. Now my doubt is, what happens when I have to predict the label for new set of data. Say,

df = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})

Because when I normalize the column the values of A and B will be changed according to the new data, not the data which the model will be trained on. So, now my data after the data preparation step that is as below, will be.

data[['A','B']] = min_max_scaler.fit_transform(data[['A','B']])

Values of A and B will change with respect to the Max and Min value of df[['A','B']]. The data prep of df[['A','B']] is with respect to Min Max of df[['A','B']].

How can the data preparation be valid with respect to different numbers relate? I don't understand how the prediction will be correct here.


Solution

  • You should fit the MinMaxScaler using the training data and then apply the scaler on the testing data before the prediction.


    In summary:

    • Step 1: fit the scaler on the TRAINING data
    • Step 2: use the scaler to transform the TRAINING data
    • Step 3: use the transformed training data to fit the predictive model
    • Step 4: use the scaler to transform the TEST data
    • Step 5: predict using the trained model (step 3) and the transformed TEST data (step 4).

    Example using your data:

    from sklearn import preprocessing
    min_max_scaler = preprocessing.MinMaxScaler()
    #training data
    df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
    #fit and transform the training data and use them for the model training
    df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
    df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)
    
    #fit the model
    model.fit(df['A','B'])
    
    #after the model training on the transformed training data define the testing data df_test
    df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
    
    #before the prediction of the test data, ONLY APPLY the scaler on them
    df_test[['A','B']] = min_max_scaler.transform(df_test[['A','B']])
    
    #test the model
    y_predicted_from_model = model.predict(df_test['A','B'])
    

    Example using iris data:

    import matplotlib.pyplot as plt
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.svm import SVC
    
    data = datasets.load_iris()
    X = data.data
    y = data.target
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    
    model = SVC()
    model.fit(X_train_scaled, y_train)
    
    X_test_scaled = scaler.transform(X_test)
    y_pred = model.predict(X_test_scaled)
    

    Hope this helps.

    See also by post here: https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79