python pandas machine-learning data-analysis preprocessor

Preprocessing Dataset with Large Categorical Variables

I have tried to find out basic answers for this question, but none on Stack Overflow seems a best fit.

I have a dataset with 40 columns and 55,000 rows. Only 8 out of these columns are numerical. The remaining 32 are categorical with string values in each.

Now I wish to do an exploratory data analysis for a predictive model and I need to drop certain irrelevant columns that do not show high correlation with the target (variable to predict). But since all of these 32 variables are categorical what can I do to see their relevance with the target variable?

What I am thinking to try:

LabelEncoding all 32 columns then run a Dimensional Reduction via PCA, and then create a predictive model. (If I do this, then how can I clean my data by removing the irrelevant columns that have low corr() with target?)
One Hot Encoding all 32 columns and directly run a predictive model on it. (If I do this, then the concept of cleaning data is lost totally, and the number of columns will skyrocket and the model will consider all relevant and irrelevant variables for its prediction!)

What should be the best practice in such a situation to make a predictive model in the end where you have many categorical columns?

Solution

you got to check the correlation.. There are two scenarios I can think of..

if the target variable is continuous and independent variable is categorical, you can go with Kendall Tau correlation
if both target and independent variable are categorical, you can go with CramersV correlation

There's a package in python which cam do all of these for you and you can select only columns that you need..

pip install ctrl4ai

from ctrl4ai import automl

automl.preprocess(dataframe, learning type)

use help(automl.preprocess) to understand more about the hyper parameters and you can customise your preprocessing in the way you want to..

please check automl.master_correlation which checks correlation based on the approach I explained above.