Search code examples
pythonpandasscikit-learnlist-comprehensionone-hot-encoding

What is the best way to create multiple dummy variables in a data frame based on the original column's dtype being an object?


I have a DataFrame with many columns that need to be dummied based on their dtype being an object. What is the fastest and most effective way to one hot encode/dummy these columns? List comprehension? Lambda? Regular functions and variable asignment? I will be using some columns in a linear regression model eventually. The data set is very large already so if I can do this without creating an excess amount of columns that'd be ideal. Here's a failed example of code I'm trying to make work:

[pd.get_dummies(col for col in df.columns if df.columns.dtype == 'object')]

Solution

  • You can use select_dtypes to send the subset you want to turn into dummies. You can concat this back to the original DataFrame.

    pd.get_dummies(df.select_dtypes('O'))
    

    Otherwise you'd pass the entire DataFrame and would specify the columns in a list. You could use a list comprehension, or just check which dtypes are object.

    pd.get_dummies(df, columns=df.dtypes.loc[lambda x: x == 'O'].index.tolist())