Search code examples
python-3.xscikit-learnscientific-notation

Why is not variance of normalized data by sklearn equal 1?


I'm using preprocessing from package sklearn to normalize data as follows:

import pandas as pd
import urllib3
from sklearn import preprocessing

decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()

nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()

The result is

enter image description here

The mean is -1.516402e-16, which is almost 0. On the contrary, the variance is 1.012423e+00, which is 1.012423. For me, 1.012423 is not considered as near 1.

Could you please elaborate on this phenomenon?


Solution

  • In this instance sklearn and pandas calculate std differently.

    sklearn.preprocessing.scale:

    We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

    pandas.Dataframe.describe uses pandas.core.series.Series.std where:

    Normalized by N-1 by default. This can be changed using the ddof argument

    ...

    ddof : int, default 1 Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

    It should be noted that, in 2020-10-28, pandas.Dataframe.describe does not have a ddof parameter so the default of ddof=1 is always used for Series.