Search code examples
rdplyrgroupingmissing-datasummarize

How to replace na in a column with the first non-missing value without dropping cases that only have missing values using R?


I have a long data frame that has many NAs, but I want to condenses it so all NAs are filled with the first non-missing value when grouped by a variable--but if the observation only has NAs, it keeps it. Until I updated R, I had a code that worked (shown below), but now it deletes rows if one of their columns is all NAs.

Here's a sample dataset:

library(dplyr)

test <- tibble(name = c("J", "C", "J", "C"),
               test_1 = c(1:2, NA, NA),
               test_2 = c(NA, NA, 3:4),
               make_up_test = c(NA, 1, NA, NA))

And here's what used to work--but now deletes observations that only have NAs in one column (see J getting dropped because he only has NAs for test observation)


test %>%
  group_by(name) %>%
  summarise_all(~first(na.omit(.)))

This is what I'm hoping to get:

solution <- tibble(name = c("J", "C"),
                test_1 = c(1:2),
                test_2 = c(3:4),
                make_up_test = c(NA, 1))


Solution

  • We remove the NA with na.omit and get the first element - use [1] to coerce to NA if there are no non-NA elements present

    library(dplyr)
    test %>% 
      group_by(name) %>% 
      summarise(across(everything(), ~ first(na.omit(.x))[1]))
    

    -output

    # A tibble: 2 × 4
      name  test_1 test_2 make_up_test
      <chr>  <int>  <int>        <dbl>
    1 C          2      4            1
    2 J          1      3           NA