I have a population of 2 million people and 700 variables (which have many nulls, zeros, or -9999), for which I develop a python model, which works as follows
I generate a dataframe of the entire population and variables
I take out the variables that I don't need (ID, name for example)
I partition the data and calculate its indicators with ks_2samp (scipy.stats)
and then filter those with very low indices (roc, ks)
Then, I take the correlations of all the variables and filter again.
Right after this, I generate the model using Xgboost. Using shap.TreeExplainer
, I get the importance of all the variables that were finally left in my model (about 90 variables)
Although I reduced the number of variables, there are still too many. Does anyone know any way to keep removing variables? My idea is to get up to 30 variables.
You can use a function to find the best model with high accuracy with minimal features.
Pseudo code
1. Create model with features n. 2. Measure model's objective or accuracy for example. 3. Save accuracy, and features used. 4. If number of features is only 30 goto step 8. 5. Get feature importance. 6. Drop the feature with lowest value 7. Goto step 1. 8. Show saved accuracy and features and select what you want, like a high accuracy but more features or a not so bad accuracy with less features.
You can also use optuna or other hyperparameter tuners. It will try to find the best model by accuracy (or other objective you want), identify which features to use and how many features to use.