Search code examples
pythonpandasscikit-learntheanosklearn-pandas

StandardScaler Doesn't Scale Properly


I am trying to use StandardScaler to scale the features of a neural network.

Lets say then neural network has these features:

1.0  2.0   3.0
4.0  5.0   6.0
4.0  11.0  12.0
etc ...

When I apply StandardScaler to the whole thing (all rows) I get the following result for the first row:

['-0.920854068785', '-0.88080603151', '-0.571888559111']

When I try to apply the StandardScaler to the first row only (matrix consisting of just the first row) I get completelly different result.

['0.0', '0.0', '0.0']

Obviosly the neural network won't work this way, because the rows are not the same. Is there any way to use Standard scaller in a way so I get the same results each time, for the same input(line)?

Here is the code and the output:

from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()

#defining the (big) matrix
AR = np.array([[1.0,2.0,3.0],[4.0,5.0,6.0],[4.0,11.0,12.0],[42.0,131.0,1121.0],[41.0,111.0,121.0]])
AR = sc.fit_transform(AR)
print "fited data from big array:"
m=0
for row in AR: 
    m = m + 1
    if m==1:print [str(m) for m in row]

#defining the (small) matrix
AR1 = np.array([[1.0,2.0,3.0]])
AR1 = sc.fit_transform(AR1)
print "fited data from small array"
for row in AR1: 
     print [str(m) for m in row]

The output is:

fited data from big array:
['-0.920854068785', '-0.88080603151', '-0.571888559111']
fited data from small array
['0.0', '0.0', '0.0']

Solution

  • StandardScaler will shift the data by mean and scale it by std, since you only pass one row to it, mean for each column is the value itself and value will be shifted to zero. See more here.

    >>> sc = StandardScaler()
    >>> arr = np.array([[1.0,2.0,3.0]])
    >>> sc.fit(arr)
    
    >>> sc.mean_, sc.scale_
    array([ 1.,  2.,  3.]), array([ 1.,  1.,  1.]))
    

    In your case, you should fit the scaler to all the data and for each row, you can use transform to get the result.

    sc.fit(data) # this will compute mean and std on all rows
    scaled_row = sc.transform(row) # apply shift to a single row