Search code examples
pythonmachine-learningdata-sciencenansklearn-pandas

Attribute error when handling missing categorical data


I'm trying to fill NaN categorical values using CategoricalImputer from sklearn_pandas.

from sklearn_pandas import CategoricalImputer
imputer = CategoricalImputer()

nan_columns = train_df.loc[:, train_df.isnull().any()]

for column in nan_columns:
  imputer.fit_transform(column)

But imputer.fit_transform(column) gives me this error:

AttributeError: 'str' object has no attribute 'copy'

I'm doing this following the documentation. Where am I going wrong?

Edit:

I added this cell:

from sklearn.impute import SimpleImputer

nan_columns = train_df.loc[:, train_df.isnull().any()]
imputer = SimpleImputer(strategy="most_frequent")

imputer.fit_transform(train_df)
msno.bar(train_df.sample(1000), labels=True, fontsize=8)

However, it didn't work. This is the bar graph showing that there are still missing values in the columns:

enter image description here


Solution

  • You can use SimpleImputer from scikit-learn with categorical values by using `strategy="most_frequent".

    imp = SimpleImputer(strategy="most_frequent")
    df = pd.DataFrame({"x": ["a", "a", np.nan],
                       "y": ["c", np.nan, "c"],
                       "z": ["a", np.nan, np.nan]})
    print(df)
    df[:] = imp.fit_transform(df)
    print(df)
    

    yields

         x    y    z
    0    a    c    a
    1    a  NaN  NaN
    2  NaN    c  NaN
    
       x  y  z
    0  a  c  a
    1  a  c  a
    2  a  c  a
    

    If you only want to use it on string or categorical columns:

    for col, tp in df.dtypes.items():
        if tp == object or tp.name == "category":
            df[col] = imp.fit_transform(df[[col]])