I have a pandas Dataframe which contains floats, dates, integers, and classes. Due to the sheer amount of column, what would be the most automated way for me to select columns who require it (mainly the ones which are classes) and then label encode those?
FYI: Dates must not be label encoded
Try this -
# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()
# you can also pass a list like -
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
After that you can make a pipeline like this -
# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)
# full pipeline
full_pipe = ColumnTransformer(
[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)