python pandas scikit-learn sklearn-pandas

Label encoding several columns in DataFrame but only those who need it

I have a pandas Dataframe which contains floats, dates, integers, and classes. Due to the sheer amount of column, what would be the most automated way for me to select columns who require it (mainly the ones which are classes) and then label encode those?

FYI: Dates must not be label encoded

Solution

Try this -

# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()

# you can also pass a list like - 
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()

After that you can make a pipeline like this -

# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown="ignore", sparse=False),
)

# full pipeline
full_pipe = ColumnTransformer(
    [("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)