Search code examples
pythonmachine-learningfeature-selectionone-hot-encodingchi-squared

feature selection after one hot encoding


I have done one hot encoding on my X_train dataframe in order to convert the categorical variables in the dataframe to numerical variables. This resulted in my columns to increase significantly with some members/elements of some columns named as individual columns. I then ran feature selection using the filter method's univariate selection algorithm and selected top 15 features that correlate most with my target variable using the selectKbest and chi square methods. The problem here now is that, the features selected have confusing names. Here is my code:

X_train = pd.get_dummies(X_train)

X_test = pd.get_dummies(X_test)


y_train = pd.get_dummies(y_train)

y_test = pd.get_dummies(y_test)


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train,y_train)

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train,y_train)

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns

print(featureScores.nlargest(15,'Score'))  #print 10 best features


 Specs          Score
4                                weeks worked in year  131890.720755
2                     num_persons_worked_for_employer   10900.486787
1                                     instance_weight    8087.766885
67  major_occupation_code_ Executive admin and man...    7606.586291
29  education_ Prof school degree (MD DDS DVM LLB JD)    5616.479469
75      major_occupation_code_ Professional specialty    5505.713604
28  education_ Masters degree(MA MS MEng MEd MSW MBA)    5019.018784
24              education_ Bachelors degree(BA AB BS)    3692.481274
25               education_ Doctorate degree(PhD EdD)    3587.589683
96                                          sex_ Male    3424.928788
11        class_of_worker_ Self-employed-incorporated    3372.042663
55   major_industry_code_ Not in universe or children    3142.494445
71             major_occupation_code_ Not in universe    3142.494445
9                    class_of_worker_ Not in universe    3125.278635
95                                        sex_ Female    3034.914202

For example, 'class_of_worker_ Self-employed-incorporated' (no. 11) and 'class_of_worker_ Not in universe' (no. 9) features are selected. However, these columns/features are features that were created from the 'class_of_worker' column after one hot encoding and this column actually has about 9 more elements, therefore, about 8 more features created from 'class_of_worker' column but have not been selected by the univariate selection method. Is this right? How do I select just two of the eight 'class_of_worker' features and forget the rest six?


Solution

  • Doing this manually would make it easier to see things that could be tried. "How do I select just two of the eight 'class_of_worker' features and forget the rest six?" Simple, just replace the variable value with one that has only three values: the two you identified and "Other" for everything else. This could have been done beforehand based on the variable values with the highest-frequencies. You can also do it as you have after running feature selection when you see which variable values have been identified as most important. That way you capture variable values that are identified in correspondence analysis (I believe) that might be low-frequency but become important in combination with a different variable's values.

    For the "confusing names" issue you can construct your own as the concatenation of variable name, decimal point and variable value. That will be as unconfusing as things can possibly get.

    This is a good question. I just think doing it without so much reliance on packages would make things much clearer. You can easily construct your own chi-squared correspondence analysis test for each pair of variable values. Then you are basically forced to think of things like "is there a target variable or is this just exploratory?". If there is a target variable, then that variable's values would always be part of each correspondence analysis table and would be crossed with each of the other variables' values. If there isn't a target variable and this is just exploratory, then it would be best to use the "Other" technique but keeping a larger number of variable values, e.g. the top 50, and mark the lower frequency ones with "Other". https://www.statology.org/chi-square-test-of-independence/

    Also, you may want to give the R package "factomineR" a look. It just does exploratory but is very useful. https://cran.r-project.org/web/packages/FactoMineR/index.html