Search code examples
pythonmachine-learningregressiondata-scienceclassification

Recalculate the values of the binary classification probabilities based on the threshold


I have highly imbalanced data so for binary classification I convert probabilities for 1-class with threshold = 0.06.

I want to show probabilities to management so I need to adjust then on condition that 0.06 is my new 50% boundary.

So I want my low probability, like 0.045, 0.067, 0.01 values to be recalculated to be higher percentage.

I guess I should multuply it, but I don't know how to find the value.

data for reference

  id     probability
_____________________
168835    0.529622
168836    0.870282
168837    0.988074
180922    0.457827
78352     0.272279
            ...   
320739    0.003046
329237    0.692332
329238    0.926343
329239    0.994264
320741    0.002714

Solution

  • Not sure if it is any useful after a year, but what you have to do is apply inverse function, to get back the x values, move everything left and reapply your probability function to get back the probabilities. Multiplying won't work, unless you are using linear function, which I'm guessing is not the case.

    Assuming you use a standard logistic regression your code for recalculating probabilities should look something like this

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({"probability_old":[0.529622,0.870282,0.988074,0.457827,0.272279,0.003046,0.692332,0.926343,0.994264,0.002714,0.06,0.5]})
    
    def sig(z):
        return 1/(1+np.exp(-z))
    def inv_sig(z):
        return np.log(z/(1-z))
    
    y_0 = 0.06
    # inv_sig(y_0) ≈ -2.75
    df["probability_new"] = sig(inv_sig(df["probability_old"]) - inv_sig(y_0))
    

    Results:

    id probability_old probability_new
    0 0.529622 0.946352
    1 0.870282 0.990576
    2 0.988074 0.999230
    3 0.457827 0.929723
    4 0.272279 0.854264
    5 0.003046 0.045680
    6 0.692332 0.972417
    7 0.926343 0.994950
    8 0.994264 0.999632
    9 0.002714 0.040892
    10 0.060000 0.500000
    11 0.500000 0.940000

    Hopefully this image will clarify the logic behind the code

    image comparing old probability function to the new one