Search code examples
rdplyrpanelna

Remove group of specific rows in panel data when multiple columns contain NA


I have a balanced panel data set df with columns:

id, time, var1, var2, var3, var4....

Some values in var1 and var4 contain Na elements.

If they contain the NA element, I want to remove all the corresponding id rows that contain the NA elements so that my data remains balanced.

So if var1 contains Na that corresponds to id=28, I want to remove all the id=28 rows. Can I do that using Dplyr in R?

I did something like this:

df %>% select(c(var1,var4)) %>% group_by(id) %>% filter( !is.na(.))

But it doesn't work. Could you please provide some guidance?


Solution

  • Always provide sample data to get better and faster answers.

    Try filter with if_all and starts_with.

    The code:

    # Library(tidyverse)
    
    set.seed(128)
    aux <- tibble(
      id = 1:10, 
      time = now() + seconds(sample(c(rep(NA, 1000), -3600:3600), 10, replace = TRUE)) %>% hms::hms(),
      var1 = sample(c(NA, 1:10), 10, replace = TRUE),
      var2 = sample(c(NA, 1:10), 10, replace = TRUE),
      var3 = sample(c(NA, 1:10), 10, replace = TRUE),
      var4 = sample(c(NA, 1:10), 10, replace = TRUE))
    
    aux <- filter(aux, if_all(starts_with("var"), \(x) !is.na(x)))
    

    Input:

    > aux
    # A tibble: 10 × 6
          id time                 var1  var2  var3  var4
       <int> <dttm>              <int> <int> <int> <int>
     1     1 2024-04-11 16:30:32    NA    NA     3     1
     2     2 2024-04-11 16:11:08     3     5     1     9
     3     3 2024-04-11 14:47:42     5     2     9     4
     4     4 2024-04-11 15:43:26     7     6     6    NA
     5     5 NA                      6     2     5    10
     6     6 2024-04-11 15:18:34     5     3     4    NA
     7     7 2024-04-11 15:54:08     9    NA     7     8
     8     8 2024-04-11 16:19:12     1    10     7     9
     9     9 2024-04-11 15:46:24     2    NA     2    NA
    10    10 2024-04-11 15:32:55     1     1     9     6
    

    Output:

    # A tibble: 5 × 6
         id time                 var1  var2  var3  var4
      <int> <dttm>              <int> <int> <int> <int>
    1     2 2024-04-11 16:11:08     3     5     1     9
    2     3 2024-04-11 14:47:42     5     2     9     4
    3     5 NA                      6     2     5    10
    4     8 2024-04-11 16:19:12     1    10     7     9
    5    10 2024-04-11 15:32:55     1     1     9     6