Search code examples
rdplyrglmquantile

Why is the quantile function not working for this dplyr function?


I'm working through Faraway's 2016 book Extending the Linear Model with R and have encountered an issue with the code that I don't know how to fix. Here is the relevant syntax leading up to the error:

#### Load Data & Libraries ####
library(faraway)
library(tidyverse)
data(wcgs)

#### Add Variables ####
wcgs$y <- ifelse(wcgs$chd == "no",0,1) # create binary response from chd 
wcgs$bmi <- with(wcgs,
                 703*wcgs$weight/(wcgs$height^2)) # create BMI variable

#### Create GLM Model ####
lmod <- glm(chd ~ height + cigs,
            family = binomial, 
            wcgs)

#### Mutate Data ####
wcgs <- mutate(wcgs,
               residuals=residuals(lmod), 
               linpred=predict(lmod)) # create residuals/pred values

And this is the part where the error arises (the third line which includes a mutate function:

#### Error Code (Last Line) ####
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
                predprob=predict(lmod,
                                 type="response")) # make pred data
gdf <- group_by(wcgsm, 
                cut(linpred,
                    breaks=unique(quantile(linpred, 
                                           (1:100)/101)))) # bin NA

Which gives me this error:

Error in `group_by()`:
! Problem adding computed columns.
Caused by error in `mutate()`:
! Problem while computing `..1 = cut(linpred, breaks = unique(quantile(linpred,
  (1:100)/101)))`.
✖ `..1` must be size 3140 or 1, not 3154.

I dont understand what this error means. When I run dim(wcgs), I get there are 3154 rows, and when I run dim(na.omit(wcgs)) I get 3140 rows. The only thing I can think of is that the predicted model values dont line up with the new na.omit data, but I'm not sure now how to work around that given the rest of this chapter uses this data manipulation.


Solution

  • predict methods for R's modeling functions always predict from the original data set the models were fitted to. To have a new data set, in this case a subset of the data wcgs, argument newdata must be explicitly set.
    The error in the predict line at the bottom is therefore expected behavior.

    #### Load Data & Libraries ####
    suppressPackageStartupMessages({
      #library(faraway)
      library(dplyr)
    })
    data(wcgs, package = "faraway")
    
    #### Add Variables ####
    wcgs$y <- as.integer(wcgs$chd == "yes") # create binary response from chd 
    wcgs$bmi <- with(wcgs, 703*weight/(height^2)) # create BMI variable
    
    #### Create GLM Model ####
    lmod <- glm(chd ~ height + cigs, family = binomial, data = wcgs)
    
    #### Mutate Data ####
    # create residuals/pred values
    wcgs <- mutate(wcgs,
                   residuals = residuals(lmod), 
                   linpred = predict(lmod)) 
    
    wcgsm <- na.omit(wcgs) # omit NA values
    wcgsm <- mutate(wcgsm,
                    predprob = predict(lmod, type="response")) # make pred data
    #> Error in `mutate()`:
    #> ! Problem while computing `predprob = predict(lmod, type = "response")`.
    #> ✖ `predprob` must be size 3140 or 1, not 3154.
    

    Created on 2022-07-16 by the reprex package (v2.0.1)

    See where the error comes from.

    predprob_all <- predict(lmod, type = "response")
    predprob_na.omit <- predict(lmod, newdata = wcgsm, type = "response")
    
    length(predprob_all)
    #> [1] 3154
    length(predprob_na.omit)
    #> [1] 3140
    

    Created on 2022-07-16 by the reprex package (v2.0.1)

    These lengths are the values in the error message, once again, as expected.

    There is also the problem of the quantiles in cut(., breaks) not spanning the entire range of linpred. Values outside the quantiles' range will become NA. This is solved with the two endpoints of the breaks vector.
    And I have given a name to the binned vector.
    The following code works and, I believe, does what is needed.

    wcgsm <- na.omit(wcgs) # omit NA values
    wcgsm <- mutate(wcgsm,
                    predprob = predict(lmod, newdata = wcgsm, type="response")) # make pred data
    
    breaks <- c(-Inf, 
                unique(quantile(wcgsm$linpred, (1:100)/101)),
                 Inf)
    gdf <- group_by(wcgsm, 
                    bins = cut(linpred, breaks = breaks)) # bin NA
    
    anyNA(gdf$bins)
    #> [1] FALSE
    

    Created on 2022-07-16 by the reprex package (v2.0.1)