Search code examples
pythonpandasscikit-learnsklearn-pandas

Label encoding several columns in DataFrame but only those who need it


I have a pandas Dataframe which contains floats, dates, integers, and classes. Due to the sheer amount of column, what would be the most automated way for me to select columns who require it (mainly the ones which are classes) and then label encode those?

FYI: Dates must not be label encoded


Solution

  • Try this -

    # To select numerical and categorical columns
    num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
    cat_cols = X_train.select_dtypes(include="object").columns.tolist()
    
    # you can also pass a list like - 
    cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
    

    After that you can make a pipeline like this -

    # numerical data preprocessing pipeline
    num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
    
    # categorical data preprocessing pipeline
    cat_pipe = make_pipeline(
        SimpleImputer(strategy="constant", fill_value="NA"),
        OneHotEncoder(handle_unknown="ignore", sparse=False),
    )
    
    # full pipeline
    full_pipe = ColumnTransformer(
        [("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
    )