I have successfully run a machine learning algorithm usuing xgboost on Python 3.8.5, but am struggling with interpretation of the results.
the output/target is binary, deceased or not deceased.
Both myself and my audience understand odds ratios like what come from R's glm
well, and I'm sure that xgboost can display this information somehow, but I can't figure out how.
My first instinct is to look at the output from xgboost
's predict_proba
but when I do that, I get
>>> deceased.pp.view()
array([[0.5828363 , 0.4171637 ],
[0.89795643, 0.10204358],
[0.5828363 , 0.4171637 ],
[0.89795643, 0.10204358]], dtype=float32)
I'm assuming that these are the p that would go into the formula 1/(1-p) to calculate an odds ratio for each input term like sex
and age
.
I found a similar question on this website but the answer doesn't help me:
xgboost predict_proba : How to do the mapping between the probabilities and the labels
so based on the answer there, I use the .classes_
to get this
>>> deceased.xg_clf.classes_
array([False, True])
In fact, I'm not even sure that xgboost can give glm-like odds ratios, the closest thing seems to be feature_importances.
However, feature importance doesn't give the same information that odds ratios do.
but .classes_
tells me nothing about how to find out which input categories, e.g. age
or sex
have what probabilities.
How can I link classes_
with the input categories?
Or if that is not correct or impossible, how else can I calculate odds ratios for each input variable in xgboost?
Agreed that it doesn't really fit for XGBoost to provide something like an odds ratio. Have you taken a look at other forms of model interpretability for more complex models like XGBoost? shap
, for example, is a library that can provide similar sorts of analysis but is more well-suited for these types of models.