Search code examples
python-3.xmachine-learningdata-sciencedata-cleaning

How to identify the categorical variables in the 200+ numerical variables?


I have a dataset which has 200+ numerical variables (type:int). In those variables, a few are categorical variables having values like (0,1),(0,1,2,3,4) etc.

I need to identify these categorical variables and dummify them. Identifying and dummifying them takes a lot of time - is there any way to do it easily?


Solution

  • You could say that some variables are categorical or treat them as categorical by the length of their unique values. For instance if a variable has only unique values [-2,4,56] you could treat this variable as categorical.

    import pandas as pd
    import numpy as np
    col = [c for c in train.columns if c not in ['id','target']]
    numclasses=[]
    for c in col:
        numclasses.append(len(np.unique(train[[c]])))
    
    threshold=10
    categorical_variables = list(np.array(col2)[np.array(numclasses2)<threshold]
    

    Every unique value in every variable treated as categorical will create a new column. If you want not to many columns to be created later as dummies, you can use small threshold.