Search code examples
python-3.xmachine-learningscikit-learnclassificationlogistic-regression

Deriving new continuous variable out of logistic regression coefficients


I have a set of independent variables X and set of values of dependent variable Y. The task at hand is binomial classification, i.e. predict whether debtor will default on his debt (1) or not (0). After filtering out statistically insignificant variables and variables that bring about multicollinearity I am left with following summary of logistic regression model:

Accuracy ~0.87
Confusion matrix [[1038 254]
                  [72 1182]]
Parameters Coefficients
intercept  -4.210
A          5.119
B          0.873
C          -1.414
D          3.757

Now, I convert these coefficients into new continuous variable "default_probability" via log odds_ratio, i.e.

import math
e = math.e
power = (-4.210*1) + (A*5.119) + (B*0.873) + (C*-1.414) + (D*3.757)
default_probability = (e**power)/(1+(e**power))

When I divide my original dataset into quartiles according to this new continuos variable "default_probability", then:

1st quartile contains 65% of defaulted debts (577 out of 884 incidents)
2nd quartile contains 23% of defaulted debts (206 out of 884 incidents)
3rd quartile contains 9% of defaulted debts (77 out of 884 incidents)
4th quartile contains 3% of defaulted debts (24 out of 884 incidents)

At the same time:

overall quantity of debtors in 1st quartile - 1145
overall quantity of debtors in 1st quartile - 516
overall quantity of debtors in 1st quartile - 255
overall quantity of debtors in 1st quartile - 3043

I wanted to use "default probability" to surgically remove the most problematic credits by imposing the business-rule "no credit to the 1st quartile", but now I wonder whether it is "surgical" at all (by this rule I will lose (1145 - 577 = 568 "good" clients) and overall is it mathematically/logically correct to derive new continuous variables for the dataset out of the coefficients of logistic regression by the line of reasoning described above?


Solution

  • You have forgotten the intercept when you compute power. But supposing this is only a typo like you said in the comments, then your approach is valid. However, you might want to use scikit-learn's predict_proba function, which will save you the trouble. Example:

    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import load_breast_cancer
    import numpy as np
    
    data = load_breast_cancer()
    X = data.data
    y = data.target
    
    lr = LogisticRegression()
    
    lr.fit(X,y)
    

    Suppose I then want to compute the probability of belonging to class 1 for a given observation (say observation i), I can do what you have done, essentially using the regression coefficients and the intercept like you have done:

    i = 0
    1/(1+np.exp(-X[i].dot(lr.coef_[0])-lr.intercept_[0]))
    

    Or just do :

    lr.predict_proba(X)[i][1]
    

    which is faster