Offsetting Oversampling in SAS for rare events in Logistic Regression

Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .

I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic? Thanks in advance for your help!

Solution

Ok here are my 2 cents.

Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.

However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).

Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.

Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.

There are generally two ways to do that:

weighted logistic regression,
simply adding offset.

I am going to explain the offset version only as per your question.

Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv

data dummy_data;
    do j=1 to 1000;
        iv=rannor(10000); *independent variable;
        p=1/(1+exp(-(-6+2*iv))); * event probability;
        y=ranbin(10000,1,p);  * independent variable 1/0;
        drop j;
        output;
    end;
run;

and let’s see your event rate:

proc freq data=dummy_data;
tables y;
run;


    Cumulative    Cumulative
y    Frequency     Percent     Frequency      Percent
------------------------------------------------------
0         979       97.90           979        97.90  
1          21        2.10          1000       100.00

Similar to your problem the event rate is p=0.0210, in other words very rare

Let’s use poc logistic to estimate parameters

proc logistic data=dummy_data;
model y(event="1")=iv;
run;

                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept     1     -5.4337      0.4874      124.3027        <.0001
iv            1      1.8356      0.2776       43.7116        <.0001

Logistic result is quite close to the real model however basic assumption will not hold as you already know.

Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2

data oversampling; 
set dummy_data;
   if y=1 then output;
   if y=0 then do;
     if ranuni(10000)<1/20 then output;
   end;
run;



proc freq data=oversampling;
tables y;
run;

                      Cumulative    Cumulative
y    Frequency     Percent     Frequency      Percent
------------------------------------------------------
0          54       72.00            54        72.00  
1          21       28.00            75       100.00

Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.

proc logistic data=oversampling;
model y(event="1")=iv;
run;
                                 Standard        Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept     1     -2.9836      0.6982       18.2622        <.0001
iv            1      2.0068      0.5139       15.2519        <.0001

As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.

Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).

Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548

So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.

Or using the offset option in proc logistic:

data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;

proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;


                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept     1     -5.8811      0.6982       70.9582        <.0001
iv            1      2.0068      0.5138       15.2518        <.0001
off           1      1.0000           0         .             .

From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.

Hope its help.