Search code examples
python-3.xmachine-learningscikit-learnspyderimputation

How to preprocess a dataset with many types of missing data


I'm trying to do the beginner machine learning project Big Mart Sales. The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)

My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:

from sklearn.impute import SimpleImputer as Imputer

# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])

# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])

# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])

However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?

In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.

Here are some rows of my data: enter image description here


Solution

  • There is a python package which can do this for you in a simple way, ctrl4ai

    pip install ctrl4ai
    
    from ctrl4ai import preprocessing
    
    preprocessing.impute_nulls(dataset)
    
    
    Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
    Description: Auto identifies the type of distribution in the column and imputes null values
    Note: KNN consumes more system mermory if the size of the dataset is huge
    Returns: Dataframe [with separate column for each categorical values]