Search code examples
pythonpandasdataframescikit-learnsklearn-pandas

how to encoding several column (but not all column) in dataframe python using pandas


I want to build a naive bayes model using two dataframes (test dataframe, train dataframe)

The dataframe contains 13 columns, but I just want to encode the dataframe from str to int value in just 5-6 columns. How can I do that with one code so that 6 columns can directly be encoded, I follow this answer:

https://stackoverflow.com/a/37159615/12977554

import pandas as pd
from sklearn.preprocessing import LabelEncoder

    df = pd.DataFrame({
    'colors':  ["R" ,"G", "B" ,"B" ,"G" ,"R" ,"B" ,"G" ,"G" ,"R" ,"G" ],
    'skills':  ["Java" , "C++", "SQL", "Java", "Python", "Python", "SQL","C++", "Java", "SQL", "Java"]
    })
    
    def encode_df(dataframe):
        le = LabelEncoder()
        for column in dataframe.columns:
            dataframe[column] = le.fit_transform(dataframe[column])
        return dataframe
    
    #encode the dataframe
    encode_df(df)

but it just only encodes 1 column, instead what I want is 6 columns with 1 code.


Solution

  • You can loop through the columns and fit_transform

    cols = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
    
    for col in cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype('str'))
        
    df
    

    Ideally you want to use same trasnfomer for both train and test dataset
    For that you need to use

    for col in cols:
        le = LabelEncoder()
        le.fit(df_train[col].astype('str'))
        df_train[col] = le.transform(df_train[col].astype('str'))
        df_test[col] = le.transform(df_test[col].astype('str'))
            
    df