Im conducting an experiment on blood test results data trying to predict the probability a patient has a curtain disease. using the blood test result i have reached over 2000 features and im trying to find a good way to eliminate features that doesnt help. is there more general way to find the unneccesery features ? im using xgboost and histGradientBoost models for the prediction
ive tried using feature importance but as i increase the number of patients in the dataset the important features changes ... i heard about a package called SHAP but my computer has no access to the internet and getting the package will take time
Correlation for highly correlate with each other or use PCA which can be used to identify the most important features in the data
Regarding the issue with feature importance changing as you increase the number of patients in the dataset, this is a common problem with some feature importance methods. SHAP is one way to address this issue as it provides a more accurate and stable estimate of feature importance by considering all possible feature combinations.
Hope that helps