I have a dataset which has 200+ numerical variables (type:int). In those variables, a few are categorical variables having values like (0,1),(0,1,2,3,4) etc.
I need to identify these categorical variables and dummify them. Identifying and dummifying them takes a lot of time - is there any way to do it easily?
You could say that some variables are categorical or treat them as categorical by the length of their unique values. For instance if a variable has only unique values [-2,4,56] you could treat this variable as categorical.
import pandas as pd
import numpy as np
col = [c for c in train.columns if c not in ['id','target']]
numclasses=[]
for c in col:
numclasses.append(len(np.unique(train[[c]])))
threshold=10
categorical_variables = list(np.array(col2)[np.array(numclasses2)<threshold]
Every unique value in every variable treated as categorical will create a new column. If you want not to many columns to be created later as dummies, you can use small threshold.