Effect of scaling Features when NaNs are set to -1

I have a dataset containing some features with quite a lot of NaNs (up to 80%). Removing them, would skew my overall distribution, hence my options are to set all NaNs to -1/-99 or bin my continuous variable into groups, making it a categorical feature.

As I already have many categorical features, I'd rather not make the few continuous ones, categorical too. However, if I set NaNs to -1/-99 will that significantly affect results when I scale those features?

Or from a different perspective, is there a way of scaling features without having the -1 affect its scaling too much?


  • I know that you got the answer from the comments above, but in an effort to show new scikit-learn users how you might approach a problem like this, I've put together a very rudimentary solution that demonstrates how to build a custom transformer that would handle this:

    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.utils.validation import check_array, check_is_fitted
    import numpy as np
    class NanImputeScaler(BaseEstimator, TransformerMixin):
        """Scale an array with missing values, then impute them
        with a dummy value. This prevents the imputed value from impacting
        the mean/standard deviation computation during scaling.
        with_mean : bool, optional (default=True)
            Whether to center the variables.
        with_std : bool, optional (default=True)
            Whether to divide by the standard deviation.
        nan_level : int or float, optional (default=-99.)
            The value to impute over NaN values after scaling the other features.
        def __init__(self, with_mean=True, with_std=True, nan_level=-99.):
            self.with_mean = with_mean
            self.with_std = with_std
            self.nan_level = nan_level
        def fit(self, X, y=None):
            # Check the input array, but don't force everything to be finite.
            # This also ensures the array is 2D
            X = check_array(X, force_all_finite=False, ensure_2d=True)
            # compute the statistics on the data irrespective of NaN values
            self.means_ = np.nanmean(X, axis=0)
            self.std_ = np.nanstd(X, axis=0)
            return self
        def transform(self, X):
            # Check that we have already fit this transformer
            check_is_fitted(self, "means_")
            # get a copy of X so we can change it in place
            X = check_array(X, force_all_finite=False, ensure_2d=True)
            # center if needed
            if self.with_mean:
                X -= self.means_
            # scale if needed
            if self.with_std:
                X /= self.std_
            # now fill in the missing values
            X[np.isnan(X)] = self.nan_level
            return X

    The way this works is by computing the nanmean and nanstd in the fit section so that the NaN values are ignored while computing the statistics. Then, in the transform section, after the variables are scaled and centered, the remaining NaN values are imputed with the value you designate (you mentioned -99, so that's what I defaulted to). You could always break that component of the transformer into another transformer, but I included it just for demonstration purposes.

    Example in action:

    Here we'll set up some data with NaNs present:

    nan = np.nan
    data = np.array([
        [ 1., nan,  3.],
        [ 2.,  3., nan],
        [nan,  4.,  5.],
        [ 4.,  5.,  6.]

    And when we fit the scaler and examine the means/standard deviations, you can see that they did not account for the NaN values:

    >>> imputer = NanImputeScaler().fit(data)
    >>> imputer.means_
    array([ 2.33333333,  4.        ,  4.66666667])
    >>> imputer.std_
    array([ 1.24721913,  0.81649658,  1.24721913])

    Finally, when we transform the data, the data is scaled and the NaN values are handled:

    >>> imputer.transform(data)
    array([[ -1.06904497, -99.        ,  -1.33630621],
           [ -0.26726124,  -1.22474487, -99.        ],
           [-99.        ,   0.        ,   0.26726124],
           [  1.33630621,   1.22474487,   1.06904497]])


    You can even use this pattern inside of a scikit-learn pipeline (and even persist it to disk):

    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LogisticRegression
    pipe = Pipeline([
            ("scale", NanImputeScaler()),
            ("clf", LogisticRegression())
        ]).fit(data, y)