Search code examples
pandasmachine-learningscikit-learnimputation

KNNImputer drops columns despite of numeric datatypes and right shape


I am using KNNImputer to impute np.nan values in several pd.DataFrame. I checked that all the datatypes of each one of the dataframes are numeric. However, KNNImputer drops some columns in some dataframes:

>>>input_df.shape   
(816, 216) 

>>> input_df.dtypes.value_count()
float64    216
dtype: int64

>>output_df.shape 
(816, 27)

I used the following KNNImputer configuration

imputer = KNNImputer(n_neighbors=1, 
                     weights="uniform",
                     add_indicator=False)

output_df = imputer.fit_transform(input_df)

I would like to know why it is happening since each one of the dataframes have np.nan values. By the way, the parameter n_neighbors=1 should not have any impact in the outcome since I am replacing missing values with the values of the closest neighbor.


Solution

  • I think in your data there could be some columns where there are only np.nan or empty features for all rows that can cause KNNImputer to drop that column in the output

    >>> import numpy as np
    >>> import pandas as pd
    >>> from sklearn.impute import KNNImputer
    >>> 
    >>> imputer = KNNImputer(n_neighbors=1, 
    ...                      weights="uniform",
    ...                      add_indicator=False)
    >>> 
    >>> df = pd.DataFrame([[1.69, 2.69, np.nan], [3.69, 4.69, 3.69, np.nan], [np.nan, 6.69, 5.69, np.nan], [8.69, 8.69, 7.69, np.nan]])
    >>> print(df)
          0     1     2   3
    0  1.69  2.69   NaN NaN
    1  3.69  4.69  3.69 NaN
    2   NaN  6.69  5.69 NaN
    3  8.69  8.69  7.69 NaN
    >>> print(df.shape)
    (4, 4)
    >>> print(df.dtypes.value_counts())
    float64    4
    Name: count, dtype: int64
    >>> 
    >>> output_df = imputer.fit_transform(df)
    >>> print(output_df.shape)
    (4, 3)
    

    I think you can avoid this by setting keep_empty_features param to True instead of default False to avoid removing columns

    >>> import numpy as np
    >>> import pandas as pd
    >>> from sklearn.impute import KNNImputer
    >>> 
    >>> imputer = KNNImputer(n_neighbors=1, 
    ...                      weights="uniform",
    ...                      keep_empty_features=True,
    ...                      add_indicator=False)
    >>> 
    >>> df = pd.DataFrame([[1.69, 2.69, np.nan], [3.69, 4.69, 3.69, np.nan], [np.nan, 6.69, 5.69, np.nan], [8.69, 8.69, 7.69, np.nan]])
    >>> print(df)
          0     1     2   3
    0  1.69  2.69   NaN NaN
    1  3.69  4.69  3.69 NaN
    2   NaN  6.69  5.69 NaN
    3  8.69  8.69  7.69 NaN
    >>> print(df.shape)
    (4, 4)
    >>> print(df.dtypes.value_counts())
    float64    4
    Name: count, dtype: int64
    >>> 
    >>> output_df = imputer.fit_transform(df)
    >>> print(output_df.shape)
    (4, 4)