I have done one hot encoding on my X_train dataframe in order to convert the categorical variables in the dataframe to numerical variables. This resulted in my columns to increase significantly with some members/elements of some columns named as individual columns. I then ran feature selection using the filter method's univariate selection algorithm and selected top 15 features that correlate most with my target variable using the selectKbest and chi square methods. The problem here now is that, the features selected have confusing names. Here is my code:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
y_train = pd.get_dummies(y_train)
y_test = pd.get_dummies(y_test)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train,y_train)
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(15,'Score')) #print 10 best features
Specs Score
4 weeks worked in year 131890.720755
2 num_persons_worked_for_employer 10900.486787
1 instance_weight 8087.766885
67 major_occupation_code_ Executive admin and man... 7606.586291
29 education_ Prof school degree (MD DDS DVM LLB JD) 5616.479469
75 major_occupation_code_ Professional specialty 5505.713604
28 education_ Masters degree(MA MS MEng MEd MSW MBA) 5019.018784
24 education_ Bachelors degree(BA AB BS) 3692.481274
25 education_ Doctorate degree(PhD EdD) 3587.589683
96 sex_ Male 3424.928788
11 class_of_worker_ Self-employed-incorporated 3372.042663
55 major_industry_code_ Not in universe or children 3142.494445
71 major_occupation_code_ Not in universe 3142.494445
9 class_of_worker_ Not in universe 3125.278635
95 sex_ Female 3034.914202
For example, 'class_of_worker_ Self-employed-incorporated' (no. 11) and 'class_of_worker_ Not in universe' (no. 9) features are selected. However, these columns/features are features that were created from the 'class_of_worker' column after one hot encoding and this column actually has about 9 more elements, therefore, about 8 more features created from 'class_of_worker' column but have not been selected by the univariate selection method. Is this right? How do I select just two of the eight 'class_of_worker' features and forget the rest six?
Doing this manually would make it easier to see things that could be tried. "How do I select just two of the eight 'class_of_worker' features and forget the rest six?" Simple, just replace the variable value with one that has only three values: the two you identified and "Other" for everything else. This could have been done beforehand based on the variable values with the highest-frequencies. You can also do it as you have after running feature selection when you see which variable values have been identified as most important. That way you capture variable values that are identified in correspondence analysis (I believe) that might be low-frequency but become important in combination with a different variable's values.
For the "confusing names" issue you can construct your own as the concatenation of variable name, decimal point and variable value. That will be as unconfusing as things can possibly get.
This is a good question. I just think doing it without so much reliance on packages would make things much clearer. You can easily construct your own chi-squared correspondence analysis test for each pair of variable values. Then you are basically forced to think of things like "is there a target variable or is this just exploratory?". If there is a target variable, then that variable's values would always be part of each correspondence analysis table and would be crossed with each of the other variables' values. If there isn't a target variable and this is just exploratory, then it would be best to use the "Other" technique but keeping a larger number of variable values, e.g. the top 50, and mark the lower frequency ones with "Other". https://www.statology.org/chi-square-test-of-independence/
Also, you may want to give the R package "factomineR" a look. It just does exploratory but is very useful. https://cran.r-project.org/web/packages/FactoMineR/index.html