Search code examples
rlogistic-regressionglmchurn

How can I incorporate the prior weight in to my GLM function?


I am trying to incorporate the prior settings of my dependent variable in my in using the -function. The data-set I am using is created to predict churn.

So far I am using the function below:

V1_log <- glm(CH1 ~ RET + ORD + LVB + REV3, data = trainingset, family = 
              binomial(link='logit'))

What I am looking for is how the weights function works and how to include it in the function or if there is another way to incorporate this. The dependent variable is a nominal variables with the options 0 or 1. The data set is imbalanced in a way that only 10 % has a value of 1 on the dependent variable CH1 and the other 90% has a value of 0. Therefore the weights are (0.1, 0.9)

My dataset Is build-up in the following manner:

Dataset preview

Where the independent variables vary in data type between continues and class variables and


Solution

  • Although the ratio of 0 to 1s is 1:9, it does not mean the weights are 0.1 and 0.9. The weights decides how much emphasis you want to give observation compared to the others.

    And in your case, if you want to predict something, it is essential you split your data into train and test, and see what influence the weights have on prediction.

    Below is using the pima indian diabetes example, I subsample the Yes type such that the training set has 1:9 ratio.

    set.seed(111)
    library(MASS)
    # we sample 10 from Yes and 90 from No
    idx = unlist(mapply(sample,split(1:nrow(Pima.tr),Pima.tr$type),c(90,10)))
    Data = Pima.tr
    trn = Data[idx,]
    test = Data[-idx,]
    
     table(trn$type)
    
     No Yes 
     90  10 
    

    Lets try regressing it with weight 9 if positive, 1 if negative:

    library(caret)
    W = 9
    lvl = levels(trn$type)
    #if positive we give it the defined weight, otherwise set it to 1
    fit_wts = ifelse(trn$type==lvl[2],W,1)
    fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
    # we test it on the test set
    pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
    pred = factor(pred,levels=lvl)
    confusionMatrix(pred,test$type,positive=lvl[2])
    
    Confusion Matrix and Statistics
    
              Reference
    Prediction No Yes
           No  34  26
           Yes  8  32
    

    You can see from above, you can see it's doing ok, but you are missing out on 8 positives and also falsely labeling 26 false positives. Let's say we try W = 3

    W = 3
    lvl = levels(trn$type)
    fit_wts = ifelse(trn$type==lvl[2],W,1)
    fit = glm(type ~ .,data=trn,weight=fit_wts,family=binomial)
    pred = ifelse(predict(fit,test,type="response")>0.5,lvl[2],lvl[1])
    pred = factor(pred,levels=lvl)
    confusionMatrix(pred,test$type,positive=lvl[2])
    

    Confusion Matrix and Statistics

              Reference
    Prediction No Yes
           No  39  30
           Yes  3  28
    

    Now we manage to get almost all the positive calls correct.. But still miss out on a lot of potential "Yes". Bottom line is, code above might work, but you need to do some checks to figure out what is the weight for your data.

    You can also look around the other stats provided by confusionMatrix in caret to guide your choice.