Search code examples
rlogistic-regressionglm

What does NA in odds ratio mean?


I am currently working on landing page testing with both independent and dependent variables as logical variables. I wanted to check which of these variables, if true, is a major factor for a conversion.

So basically we are testing multiple variations of a single variable. For example, we have three different images, if image 1 is true for one row, the other two variables are false.

I used Logistic regression to conduct this test. When I looked at the odds ratio output, I ended up having a lot of NAs. I am not sure how to interpret them and how to rectify them.

Below is the sample dataset. The actual data has 18000+ rows.

enter image description here

classifier1 <- glm(formula = Target ~ .,
              family = binomial,
              data = Dataset)

This is the output.

enter image description here

Does this mean I need more data? Is there some other way to conduct multivariate landing page testing?


Solution

  • It looks like two or more of your variables (columns) are perfectly correlated. Try to remove several columns.

    You can see it at the toy data.frame with the random content:

    n <- 20
    y <- matrix(sample(c(TRUE, FALSE), 5 * n, replace = TRUE), ncol = 5)
    colnames(y) <- letters[1:5]
    z <- as.data.frame(y)
    z$target <- rep(0:1, 2 * n)[1:nrow(z)]
    m <- glm(target ~ ., data = z, family = binomial)
    summary(m)
    

    At the summary you can see that everything is OK.

    Call:
    glm(formula = target ~ ., family = binomial, data = z)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -1.89808  -0.48166  -0.00004   0.64134   1.89222  
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)  
    (Intercept)  -22.3679  4700.1462  -0.005   0.9962  
    aTRUE          3.2286     1.6601   1.945   0.0518 .
    bTRUE         20.2584  4700.1459   0.004   0.9966  
    cTRUE          0.7928     1.3743   0.577   0.5640  
    dTRUE         17.0438  4700.1460   0.004   0.9971  
    eTRUE          2.9238     1.6658   1.755   0.0792 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 27.726  on 19  degrees of freedom
    Residual deviance: 14.867  on 14  degrees of freedom
    AIC: 26.867
    
    Number of Fisher Scoring iterations: 18
    

    But if you make two columns perfectly correlated as below, and then make generalized linear model:

    z$a <- z$b
    m <- glm(target ~ ., data = z, family = binomial)
    summary(m)
    

    you can observe NAs as below

    Call:
    glm(formula = target ~ ., family = binomial, data = z)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -1.66621  -1.01173   0.00001   1.06907   1.39309  
    
    Coefficients: (1 not defined because of singularities)
                 Estimate Std. Error z value Pr(>|z|)
    (Intercept)  -18.8718  3243.8340  -0.006    0.995
    aTRUE         18.7777  3243.8339   0.006    0.995
    bTRUE              NA         NA      NA       NA
    cTRUE          0.3544     1.0775   0.329    0.742
    dTRUE         17.1826  3243.8340   0.005    0.996
    eTRUE          1.1952     1.2788   0.935    0.350
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 27.726  on 19  degrees of freedom
    Residual deviance: 19.996  on 15  degrees of freedom
    AIC: 29.996
    
    Number of Fisher Scoring iterations: 17