Search code examples
pythondataframenormalizationmeanstandard-deviation

Normalizing the columns of a dataframe


I want to normalize the column in the following dataframe:

import pandas as pd
from pprint import pprint
d = {'A': [1,0,3,0], 'B':[2,0,1,0], 'C':[0,0,8,0], 'D':[1,0,0,1]}
df = pd.DataFrame(data=d)
df = (df - df.mean())/df.std()

I am not sure if the normalization is done row-wise or column-wise.

I intend to do (x - mean of elements in the column)/ standard deviation, for each column.

Is it required to divide the standard deviation by number of entries in each column?


Solution

  • Your code is run column-wise and it works correctly. However, if this was your question, there are other types of normalization, here are some that you might need:

    Mean normalization (like you did):

    normalized_df=(df-df.mean())/df.std()
              A         B    C         D
    0  0.000000  1.305582 -0.5  0.866025
    1 -0.707107 -0.783349 -0.5 -0.866025
    2  1.414214  0.261116  1.5 -0.866025
    3 -0.707107 -0.783349 -0.5  0.866025
    

    Min-Max normalization:

    normalized_df=(df-df.min())/(df.max()-df.min())
              A    B    C    D
    0  0.333333  1.0  0.0  1.0
    1  0.000000  0.0  0.0  0.0
    2  1.000000  0.5  1.0  0.0
    3  0.000000  0.0  0.0  1.0
    

    Using sklearn.preprocessin you find a lot of normalization methods (and not only) ready, such as StandardScaler, MinMaxScaler or MaxAbsScaler:

    Mean normalization using sklearn:

    import pandas as pd
    from sklearn import preprocessing
    
    mean_scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
    x_scaled = mean_scaler.fit_transform(df.values)
    normalized_df = pd.DataFrame(x_scaled)
    
              0         1         2    3
    0  0.000000  1.507557 -0.577350  1.0
    1 -0.816497 -0.904534 -0.577350 -1.0
    2  1.632993  0.301511  1.732051 -1.0
    3 -0.816497 -0.904534 -0.577350  1.0
    

    Min-Max normalization using sklearn MinMaxScaler:

    import pandas as pd
    from sklearn import preprocessing
    
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(df.values)
    normalized_df = pd.DataFrame(x_scaled)
    
              0    1    2    3
    0  0.333333  1.0  0.0  1.0
    1  0.000000  0.0  0.0  0.0
    2  1.000000  0.5  1.0  0.0
    3  0.000000  0.0  0.0  1.0
    

    I hope I have helped you!