Search code examples
pythonpandasone-hot-encoding

Convert Categorical into Numerical Python Panda Columns must be same length as key


I'm a novice at Python and start to learn Data Science using Python. Please help with my problem.

I'm following this youtube video tutorial: https://www.youtube.com/watch?v=DPeTCC9Qm80 to convert categorical data into numerical data (one hot encoding).

The problem is when I want to replace the column (e.g Job) with the encoding it was an error:

ValueError: Columns must be same length as key

I using the data from kaggle: https://www.kaggle.com/datasets/krantiswalke/bankfullcsv?resource=download

here is the code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
path="bank-full.csv"
df = pd.read_csv(path)

df_cat=df.select_dtypes(object)
df_num=df.select_dtypes(np.number)

df_cat['job']=pd.get_dummies(df_cat['job'], drop_first=True)

I know what cause the error because 'job' column will be replace with 11 columns result from one hot encoding (blue-collar, entrepreneur, housmaid, etc..)

how to solve this error?

I also try to encoding all the categorical column with this code:

cols=list(df_cat.columns) def cat_2_num(cols): for col in cols: df_cat[col]=pd.get_dummies(df_cat[col], dro_first=True) cat_2_num(cols)

but the dtypes still object not converted to uint8 (number)

I try manually code like this for 'job' column:

df_job = pd.get_dummies(df_cat['job'], drop_first=True)
df_new = pd.concat([df,df_job], axis=1)
df_new = df_new.drop('job', axis=1)

I drop 'job' column and append 'job' hot encoding result. How to do it efficiently especially for all categorical columns?


Solution

  • You do not need to specify one column at a time to do this. pd.get_dummies will automatically drop your string columns and replace them with one-hot encoded columns if you give it your entire dataframe.

    df     = pd.read_csv("/path/bank-full.csv")
    df_new = pd.get_dummies(df, drop_first=True)
    

    Output:

           age  balance  day  ...  poutcome_success  poutcome_unknown  Target_yes
    0       58     2143    5  ...                 0                 1           0
    1       44       29    5  ...                 0                 1           0
    2       33        2    5  ...                 0                 1           0
    3       47     1506    5  ...                 0                 1           0
    4       33        1    5  ...                 0                 1           0
       ...      ...  ...  ...               ...               ...         ...
    45206   51      825   17  ...                 0                 1           1
    45207   71     1729   17  ...                 0                 1           1
    45208   72     5715   17  ...                 1                 0           1
    45209   57      668   17  ...                 0                 1           0
    45210   37     2971   17  ...                 0                 0           0