I'm using preprocessing
from package sklearn
to normalize data as follows:
import pandas as pd
import urllib3
from sklearn import preprocessing
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()
nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()
The result is
The mean is -1.516402e-16
, which is almost 0. On the contrary, the variance is 1.012423e+00
, which is 1.012423
. For me, 1.012423
is not considered as near 1.
Could you please elaborate on this phenomenon?
In this instance sklearn
and pandas
calculate std
differently.
sklearn.preprocessing.scale
:
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0)
. Note that the choice ofddof
is unlikely to affect model performance.
pandas.Dataframe.describe
uses pandas.core.series.Series.std
where:
Normalized by N-1 by default. This can be changed using the ddof argument
...
ddof : int, default 1 Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
It should be noted that, in 2020-10-28, pandas.Dataframe.describe
does not have a ddof
parameter so the default of ddof=1
is always used for Series
.