Search code examples
rdata-manipulation

How to drop a column when it has more missing values than another column in R


I need to drop a column when it has more missingness than the other column. Here is a sample dataset.

df <- data.frame(id = c(1,2,3,4,5),
                 col.x = c(44,55,66,NA,NA),
                 col.y = c(44,NA,66,NA,NA))

> df
  id col.x col.y
1  1    44    44
2  2    55    NA
3  3    66    66
4  4    NA    NA
5  5    NA    NA

In this case, col.y has more missing values so I need to drop col.y. How can I get the desired dataset below?

   > df
      id col.x 
    1  1    44    
    2  2    55    
    3  3    66    
    4  4    NA    
    5  5    NA  

Solution

  • If the columns only differ with respect to their missing values, we can use coalesce():

    library(tidyverse)
    
    df <- tibble(
      id = c(1, 2, 3, 4, 5),
      col.x = c(44, 55, 66, NA, NA),
      col.y = c(44, NA, 66, NA, NA)
    )
    
    df |> 
      mutate(col = coalesce(col.x, col.y), .keep = "unused")
    #> # A tibble: 5 × 2
    #>      id   col
    #>   <dbl> <dbl>
    #> 1     1    44
    #> 2     2    55
    #> 3     3    66
    #> 4     4    NA
    #> 5     5    NA
    

    Created on 2023-03-22 with reprex v2.0.2