Search code examples
rmathlogistic-regression

Logistic regression: 'odds ratio' is essentially just the ratio - what's the point?


Trying to understand the use of logistic regression. I have the following data:

Gender  Age No.transcation  Transaction
female  18-24   138485  4047
male    18-24   144301  3766
female  25-34   248362  7559
male    25-34   295800  8126
female  35-44   265514  7171
male    35-44   379872  9047
female  45-54   295002  7072
male    45-54   421432  9648
female  55-64   382198  7529
male    55-64   456308  9016
female  65+ 352501  4856
male    65+ 465253  6889

Running logistic regression in R I get the following summary output

    > mod2 <- glm(cbind(Transaction, No.transcation) ~ Gender + Age, data = csvd, 
family = binomial())
    > summary(mod2)

    Call:
    glm(formula = cbind(Transaction, No.transcation) ~ Gender + Age, 
        family = binomial(), data = csvd)

    Deviance Residuals: 
          1        2        3        4        5        6  
     1.8732  -1.9018   2.2654  -2.1473   3.4810  -3.0228  
             7        8        9       10       11    12  
     -0.2772   0.2377  -2.5500   2.3717  -4.9638   4.3408  

    Coefficients:
                 Estimate Std. Error  z value Pr(>|z|)    
    (Intercept) -3.562800   0.011984 -297.290  < 2e-16 ***
    Gendermale  -0.051852   0.006993   -7.415 1.22e-13 ***
    Age25-34     0.044091   0.014042    3.140  0.00169 ** 
    Age35-44    -0.090757   0.013966   -6.499 8.11e-11 ***
    Age45-54    -0.164705   0.013894  -11.855  < 2e-16 ***
    Age55-64    -0.334841   0.013900  -24.088  < 2e-16 ***
    Age65+      -0.651142   0.014767  -44.094  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    (Dispersion parameter for binomial family taken to be 1)

        Null deviance: 4490.792  on 11  degrees of freedom
    Residual deviance:   93.866  on  5  degrees of freedom
    AIC: 235.5

    Number of Fisher Scoring iterations: 3

Exponentiation the coefficients to get the odds ratio, I find they are almost identical to just the ratio of users with transactions:

> exp(summary(mod2)$coefficients)
              Estimate Std. Error       z value Pr(>|z|)
(Intercept) 0.02835931   1.012056 7.735499e-130 1.000000
Gendermale  0.94946976   1.007018  6.022806e-04 1.000000
Age25-34    1.04507762   1.014141  2.310243e+01 1.001691
Age35-44    0.91323954   1.014064  1.505641e-03 1.000000
Age45-54    0.84814413   1.013991  7.106341e-06 1.000000
Age55-64    0.71545181   1.013998  3.455562e-11 1.000000
Age65+      0.52145005   1.014877  7.084264e-20 1.000000

Comparing the odds ratios to just taking the relative ratio of users with transactions divided by total users per group (and comparing it vs the male and 18-24 base group) I get pretty much the same numbers:

female  
male    94.68%


18-24   
25-34   104.21%
35-44   91.17%
45-54   84.82%
55-64   71.97%
65+ 52.66%

So what's even the point of running logistic regression here? This dataset only has 2 features, but it might as well be extended to 50 features. What use does LR have vs just looking at the ratio for each group in this case? Is it because all variables are nominal that it doesn't add much?


Solution

  • You would hope that the estimated odds ratio is close to the realised proportions like this. You are estimating the probability pr(Y=1|X=x) ; the probability of a transaction given age rage and gender. With categorical predictors like this an intuitive estimator would be the proportions of outcomes in the data. Logistic regression becomes more interesting when the predictors are continuous variables, and you'd like to predict the probability of an outcome for some value of the predictor that you haven't observed. In these cases LR lets you map an unbounded linear function of your predictor onto a probability which must by definition be bounded between 0 and 1.