Search code examples
riterationlinear-regressionpurrrmissing-data

Using an if-statement inside purrr::map to avoid missing data errors


I'm using purrr to run a series of single linear regressions across multiple columns of a grouped dataset, but am having trouble excluding groups of variables that have no data without deleting the entire group.

Thanks to andrew_reece here, I got the base code working as:

library(tidyverse)

ivs <- colnames(mtcars)[3:ncol(mtcars)]
names(ivs) <- ivs

mtcars %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

which gives a tibble in this format:

cyl iv      r.squared       p.value
4   disp    0.6484051396    0.002782827 
4   hp      0.2740558319    0.098398581 
4   drat    0.180           0.193  
4   wt      0.509           0.0137 
4   qsec    0.0557          0.485  
4   vs      0.00238         0.887  
...
6   disp    0.0106260401    0.825929685 
...

Unfortunately, my real dataset is messy and contains multiple group-variable combinations with only NAs, or with less than two real values, which lm can't handle. To show this, here is mtcars with some data in 'disp' replaced with NA. Run through the above code, mtna throws a NA-error.

#create mtcars dataset that will have a cyl group with entirely NA disp
mtna <- mtcars 
mtna$disp[mtna$disp < 147] <- NA

test <- mtna %>% group_by(cyl) %>% summarize(mean = mean(disp))

I tried to deal with this by making the lm conditional, and first using sum(!is.na) to check if there are enough real values to run lm. This allows the lm to run successfully.

mtna %>% 
  group_by(cyl) %>% 
  group_modify(function(data, key) {
    map_df(ivs, function(iv) {
      tmpvar <- eval(parse(text = paste0("data$", iv)))
      if(sum(!is.na(tmpvar)) < 3) {return(NA)} else {
      frml <- as.formula(paste("mpg", "~", iv))
      lm(frml, data = data) %>% broom::glance()
      }}, .id = "iv") 
  }) %>% 
  select(cyl, iv, r.squared, p.value)

#which gives:
       cyl    iv        r.squared  p.value
 1     4      NA        NA         NA     
 2     6      disp      0.0115     0.840 
 3     6      hp        0.0161     0.786 
 4     6      drat      0.0132     0.807 
...

However, when you look at the results, you can see that the NA has extended to the whole group, including variables other than disp (which is the only one that had missing values). There is now no data related to cyl = 4 at all, even in groups like hp and drat, which had no missing data.

What I was hoping for was something like:

cyl    iv        r.squared      p.value
4      disp      NA             NA 
4      hp        0.2740558319   0.098398581   # Currently missing
4      drat      0.1799791311   0.193450651   # Currently missing     
4      wt        0.5086325963   0.013742782.  # Currently missing
... 
6      disp      0.0106260401   0.825929685 
6      hp        0.0161462379   0.78602021  
...

I suspect this has something to do with the data format - I guess I'm mapping NA across all the results for that group, instead of just that one variable. But I have no idea how to address this. Any help is greatly appreciated!


Solution

  • I think you are almost there. Take a look at the code below. Changes are commented.

    library(tidyverse)
    
    ivs <- colnames(mtcars)[3:ncol(mtcars)]
    names(ivs) <- ivs
    
    mtna <- mtcars 
    mtna$disp[mtna$disp < 147] <- NA
    
    mtna  %>% 
      group_by(cyl) %>% 
      group_modify(function(data, key) {
        map_df(ivs, function(iv) {
          tmpvar<-eval(parse(text = paste0("data$", iv)))
          if(!is.na(sum(tmpvar))) {                          #only use complete data
            frml <- as.formula(paste("mpg", "~", iv))
            lm(frml, data = data) %>% broom::glance()
          }}, .id = "iv") 
      }) %>% 
      select(cyl, iv, r.squared, p.value) %>% 
      right_join(.,expand.grid(cyl=unique(mtna$cyl),iv=ivs),
                 by=c("cyl","iv")) %>%                       #populating with NA for columns lost before
      arrange(., cyl, iv) %>%                                #sort by cyl and iv
      as.data.frame()
      
    
      #which gives:
           cyl   iv    r.squared    p.value
        1    4   am 0.2872892493 0.08921640
        2    4 carb 0.0378466325 0.56650426
        3    4 disp           NA         NA
        4    4 drat 0.1799791311 0.19345065
        5    4 gear 0.1146552225 0.30840242
        6    4   hp 0.2740558319 0.09839858
        7    4 qsec 0.0556742395 0.48487154
        8    4   vs 0.0023819528 0.88668635
        9    4   wt 0.5086325963 0.01374278
        10   6   am 0.2810551424 0.22094408
        11   6 carb 0.0000661434 0.98619369
        12   6 disp           NA         NA
        13   6 drat 0.0131597553 0.80653099
        14   6 gear 0.0000901510 0.98388187
        15   6   hp 0.0161462379 0.78602021
        16   6 qsec 0.1753245893 0.34980138
        17   6   vs 0.2810551424 0.22094408
        18   6   wt 0.4645101506 0.09175766
        19   8   am 0.0024647887 0.86615546
        20   8 carb 0.1550637156 0.16359436
        21   8 disp 0.2701577717 0.05677488
        22   8 drat 0.0022975223 0.87074078
        23   8 gear 0.0024647887 0.86615546
        24   8   hp 0.0804491933 0.32575378
        25   8 qsec 0.0108860059 0.72261712
        26   8   vs 0.0000000000         NA
        27   8   wt 0.4229655365 0.01179281