Search code examples
pythonscikit-learndata-preprocessing

The calculated Robustscaler in sklearn seems not right


I tried the Robustscaler in sklearn, and found the results are not the same as the formula.

The formula of the Robustscaler in sklearn is:

Figure 1. The formula to calculate Robustscaler

I have a matrix shown as below:

Figure 2. The test matrix

I test the first data in feature one (row one and column one). The scaled value should be (1-3)/(5.5-1.5) = -0.5. However, the result from the sklearn is -0.67. Does anyone know where the calculation is not correct?

The code using sklearn is as below:

import numpy as np
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)

Solution

  • From the RobustScaler documentation (emphasis added):

    Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set.

    i.e. the median and IQR quantities are calculated per column, and not for the whole array.

    Having clarified that, let's calculate the scaled values for your first column manually:

    import numpy as np
    
    x1 = np.array([1, 4, 7, 2]) # your 1st column here
    
    q75, q25 = np.percentile(x1, [75 ,25])
    iqr = q75 - q25
    
    x1_med = np.median(x1)
    
    x1_scaled = (x1-x1_med)/iqr
    x1_scaled
    # array([-0.66666667,  0.33333333,  1.33333333, -0.33333333])
    

    which is the same with the first column of your own x_new, as calculated by scikit-learn:

    # your code verbatim:
    from sklearn.preprocessing import RobustScaler
    x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1,1]]
    scaler = RobustScaler(quantile_range=(25.0, 75.0),with_centering=True)
    x_new = scaler.fit_transform(x)
    print(x_new)
    # result
    [[-0.66666667 -0.375      -0.35294118 -0.33333333]
     [ 0.33333333  0.375       0.35294118  0.33333333]
     [ 1.33333333  1.125       1.05882353  1.        ]
     [-0.33333333 -0.625      -0.82352941 -1.        ]]
    
    np.all(x1_scaled == x_new[:,0])
    # True
    

    Similarly for the rest of the columns (features) - you need to calculate separately the median and IQR values for each one of them before scaling them.

    UPDATE (after comment):

    As pointed out in the Wikipedia entry on quartiles:

    For discrete distributions, there is no universal agreement on selecting the quartile values

    See also the relevant reference, Sample quantiles in statistical packages:

    There are a large number of different definitions used for sample quantiles in statistical computer packages

    Digging into the documentation of np.percentile used here, you'll see that there are no less that five (5) different methods of interpolation, and not all of them produce identical results (see also the 4 different methods demonstrated in the Wikipedia entry linked just above); here is a quick demonstration of these methods and their results in the x1 data defined above:

    np.percentile(x1, [75 ,25]) # interpolation='linear' by default
    # array([4.75, 1.75])
    
    np.percentile(x1, [75 ,25], interpolation='lower')
    # array([4, 1])
    
    np.percentile(x1, [75 ,25], interpolation='higher')
    # array([7, 2])
    
    np.percentile(x1, [75 ,25], interpolation='midpoint')
    # array([5.5, 1.5])
    
    np.percentile(x1, [75 ,25], interpolation='nearest')
    # array([4, 2])
    

    Apart from the fact that there are no two methods producing identical results, it should also be apparent that the definition you are using in your own calculations corresponds to interpolation='midpoint', while the default Numpy method is interpolation='linear'. And as Ben Reiniger correctly points out in the comments below, what is actually used in the source code of RobustScaler is np.nanpercentile (a variation pf np.percentile I have used here that is able to handle nan values) with the default interpolation='linear' setting.