python machine-learning data-science nan sklearn-pandas

Attribute error when handling missing categorical data

I'm trying to fill NaN categorical values using CategoricalImputer from sklearn_pandas.

from sklearn_pandas import CategoricalImputer
imputer = CategoricalImputer()

nan_columns = train_df.loc[:, train_df.isnull().any()]

for column in nan_columns:
  imputer.fit_transform(column)

But imputer.fit_transform(column) gives me this error:

AttributeError: 'str' object has no attribute 'copy'

I'm doing this following the documentation. Where am I going wrong?

Edit:

I added this cell:

from sklearn.impute import SimpleImputer

nan_columns = train_df.loc[:, train_df.isnull().any()]
imputer = SimpleImputer(strategy="most_frequent")

imputer.fit_transform(train_df)
msno.bar(train_df.sample(1000), labels=True, fontsize=8)

However, it didn't work. This is the bar graph showing that there are still missing values in the columns:

Solution

You can use SimpleImputer from scikit-learn with categorical values by using `strategy="most_frequent".

imp = SimpleImputer(strategy="most_frequent")
df = pd.DataFrame({"x": ["a", "a", np.nan],
                   "y": ["c", np.nan, "c"],
                   "z": ["a", np.nan, np.nan]})
print(df)
df[:] = imp.fit_transform(df)
print(df)

yields

     x    y    z
0    a    c    a
1    a  NaN  NaN
2  NaN    c  NaN

   x  y  z
0  a  c  a
1  a  c  a
2  a  c  a

If you only want to use it on string or categorical columns:

for col, tp in df.dtypes.items():
    if tp == object or tp.name == "category":
        df[col] = imp.fit_transform(df[[col]])