I'm trying to do the beginner machine learning project Big Mart Sales. The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]