Search code examples
rpredictionglmcategorical-datarisk-analysis

Risk assessment models in R, in order to get the probability of specif levels of a factor


I am working as a risk analyst, my boss assigned me a task which I don't know how to do.

Right now I want to get the probability under some specific conditions. For example, the data would look like this

sex      hair_color Credit_Score Loan_Status
"Male"    "Red"      "256"        "bad"        
"Female"  "black"    "133"        "bad"        
"Female"  "brown"    "33"         "bad"        
"Male"    "yellow"   "123"        "good"  

So we want to predict the Loan_Status for each customer. What I can do is treat "sex", "hair_color", "credit_score" as factors. and put these into the glm() in R.

But my boss wants to know "if a new customer who is male, red hair, what's the probability his loan status will be 'good'?"

or "What's the probability of male customers' loan status become 'good'?"

What kind of methods should I use? How to get the probability? I'm thinking about marginal distributions, but I don't know would this work or how can I compute it.

I hope I made this question easy to understand, and for who will help me, thank you very much for your time


Solution

  • I think this tutorial fits your problem perfectly: http://www.theanalysisfactor.com/r-tutorial-glm1/

    If you use it on you data, it would look something like this:

    sex <- factor(c("m", "f", "f", "m"))
    hair_color <- factor(c("red", "black", "brown", "yellow"))
    credit_score <- c(256, 133, 33, 123)
    loan_status <- factor(c("b", "b", "b", "g"))
    
    data <- data.frame(sex, hair_color, credit_score, loan_status)
    
    model <- glm(formula = loan_status ~ sex + hair_color + credit_score, 
             data = data, 
             family = "binomial")
    
    predict(object = model, 
        newdata = data.frame(sex = "f", hair_color = "yellow", credit_score =     100),
        type = "response")