Can anyone explain the math behind the scenes? why Python and R return me the different result? which one should I use for real-world business scenario?
original data
id cost sales item
1 300 50 pen
2 3 88 wf
3 1 70 gher
4 5 80 dger
5 2 999 ww
Python code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('Scale.csv')
df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
df
Python normalized result
id cost sales item
0 1 1.999876 -0.559003 pen
1 2 -0.497867 -0.456582 wf
2 3 -0.514686 -0.505097 gher
3 4 -0.481047 -0.478144 dger
4 5 -0.506276 1.998826 ww
and R code
library(readr)
library(dplyr)
df <- read_csv("C:/Users/Ho/Desktop/Scale.csv")
df <- df %>% mutate_each_(funs(scale(.) %>% as.vector),
vars=c("cost","sales"))
R normalized result
id cost sales item
1 1 1.7887437 -0.4999873 pen
2 2 -0.4453054 -0.4083792 wf
3 3 -0.4603495 -0.4517725 gher
4 4 -0.4302613 -0.4276651 dger
5 5 -0.4528275 1.7878041 ww
thanks @Wen
I don't use those functions in Python much but the data seems to imply that the difference is that the functions in Python use 'n' when calculating the variance to standardize with and R uses 'n-1'. We can convert between the two by multiplying and the following shows that after multiplying by sqrt(5/4) the data from R matches the Python values.
> tab <- read.table(textConnection("1 1 1.7887437 -0.4999873 pen
+ 2 2 -0.4453054 -0.4083792 wf
+ 3 3 -0.4603495 -0.4517725 gher
+ 4 4 -0.4302613 -0.4276651 dger
+ 5 5 -0.4528275 1.7878041 ww"))
> tab
V1 V2 V3 V4 V5
1 1 1 1.78874369999999994 -0.49998730000000002 pen
2 2 2 -0.44530540000000002 -0.40837920000000000 wf
3 3 3 -0.46034950000000002 -0.45177250000000002 gher
4 4 4 -0.43026130000000001 -0.42766510000000002 dger
5 5 5 -0.45282749999999999 1.78780410000000001 ww
> # To transform as if we used n in the denominator instead of
> # n-1 we just multiply by sqrt(n/(n-1))
> tab$V3 * sqrt(5/4)
[1] 1.99987625376224520 -0.49786657257386746 -0.51468638770401975
[4] -0.48104675744371517 -0.50627653604064304
> tab$V4 * sqrt(5/4)
[1] -0.55900279534329034 -0.45658182589849106 -0.50509701018251196
[4] -0.47814411760212272 1.99882574902641608