python-3.x machine-learning data-science data-cleaning

How to identify the categorical variables in the 200+ numerical variables?

I have a dataset which has 200+ numerical variables (type:int). In those variables, a few are categorical variables having values like (0,1),(0,1,2,3,4) etc.

I need to identify these categorical variables and dummify them. Identifying and dummifying them takes a lot of time - is there any way to do it easily?

Solution

You could say that some variables are categorical or treat them as categorical by the length of their unique values. For instance if a variable has only unique values [-2,4,56] you could treat this variable as categorical.

import pandas as pd
import numpy as np
col = [c for c in train.columns if c not in ['id','target']]
numclasses=[]
for c in col:
    numclasses.append(len(np.unique(train[[c]])))

threshold=10
categorical_variables = list(np.array(col2)[np.array(numclasses2)<threshold]

Every unique value in every variable treated as categorical will create a new column. If you want not to many columns to be created later as dummies, you can use small threshold.