Search code examples
pythonscikit-learninversion

Sci-Kit Scaler and Inversion Does Not Yield Identical Numbers?


A bit new to Sci-kit, but noticing some behavior that seems a bit strange with minmax normalization and inversion (but maybe this is the intended behavior??). I thought when inverting I would re-generate the original datapoints - but instead the inversion generates numbers that are "close" but not quite equal ... is this supposed to happen, or have I messed something up?

#previously imported data is in a 3 column df
scaler = sklearn.preprocessing.MinMaxScaler()
df_norm = df.copy()
(df_norm == df).all()
#RETURNS TRUE

df_norm = scaler.fit_transform(df_norm) 
df_norm = scaler.inverse_transform(df_norm)

(df_norm == df.values).all()
#RETURNS FALSE

So I'm just a bit puzzled why I have 2 identical dataframes, but after scaling and inverting the datasets are no longer equal? Many of the numbers are equal, but quite a few are not. Oddly, as shown below, some even look identical but are not showing up that way when testing with df_norm == df

df:
array([[17.21 , 17.21 , 17.23 , 17.16 ],
       [17.21 , 17.19 , 17.25 , 17.19 ],
       [17.185, 17.21 , 17.23 , 17.18 ],
       ...,
       [12.78 , 12.78 , 12.78 , 12.78 ],
       [12.78 , 12.78 , 12.78 , 12.78 ],
       [12.78 , 12.78 , 12.78 , 12.78 ]])

df_norm
array([[17.21 , 17.21 , 17.23 , 17.16 ],
       [17.21 , 17.19 , 17.25 , 17.19 ],
       [17.185, 17.21 , 17.23 , 17.18 ],
       ...,
       [12.78 , 12.78 , 12.78 , 12.78 ],
       [12.78 , 12.78 , 12.78 , 12.78 ],
       [12.78 , 12.78 , 12.78 , 12.78 ]])

df == df_norm
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       ...,
       [ True,  True, False,  True],
       [ True,  True, False,  True],
       [ True,  True, False,  True]])

Solution

  • Most probably this issue is caused by the nature of float numbers, causing effects like:

    In [17]: 0.1 + 0.2 == 0.3
    Out[17]: False
    
    In [18]: 0.1 + 0.2 - 0.3
    Out[18]: 5.551115123125783e-17
    

    Try to compare your arrays using np.allclose():

    np.allclose(df_norm, df.values)