Search code examples
rdplyrconditional-statementscoalescing

Value of first and last non missing time point by condition of other variables. conditional coalesce (dplyr)


a/b/c is different variables, t1 is time point 1, t2 is time point 2, t3 is time point 3.

The purpose is to create a two new columns: one with the first and one with the last non missing value for each row of a_t1 to a_t3. On the condition that it is also non missing in the variables b and c (at the same time point).

I think coalesce() could work with some kind of conditional formatting. However, my knowledge of this is limited. Preferably tidyverse solution, but other solutions works as well.

library(tidyverse)

df<-tibble::tribble(
                       ~a_t1, ~a_t2, ~a_t3, ~b_t1, ~b_t2, ~b_t3, ~c_t1, ~c_t2, ~c_t3,
                           1,    NA,     9,     2,    NA,     6,     3,    NA,     7,
                           2,    NA,     8,    NA,     5,     8,    NA,     1,     8,
                          NA,    NA,     3,     2,    NA,     9,     2,    NA,    22,
                          NA,     5,     9,     4,    NA,     9,     4,     5,    NA,
                          NA,     9,    10,    NA,     6,    11,    NA,     6,    NA
                       )

Expected output for the first value of a: 1, 8, 3, NA, 9

Expected output for the last value of a: 9, 8, 3, NA, 9

(in the real dataset there is more timepoints and variables to consider)


Solution

  • A way using dplyr and tidyr :

    library(dplyr)
    library(tidyr)
    
    df %>%
      #Create a row number
      mutate(row = row_number()) %>%
      #Get data in long format
      pivot_longer(cols = -row) %>%
      #Separate the data in two columns
      separate(name, c('name1', 'name2'), sep = '_') %>%
      #Group by each row and t1, t2 columns
      group_by(row, name2) %>%
      #Drop groups with all `NA` values
      filter(all(!is.na(value))) %>%
      #For each row get first and last value for "a" columns
      group_by(row) %>%
      summarise(first = first(value[name1 == 'a']), 
                last = last(value[name1 == 'a'])) %>%
      #Complete the data for missing rows.
      complete(row = 1:nrow(df))
    
    #    row first last
    #  <int> <dbl> <dbl>
    #1     1     1     9
    #2     2     8     8
    #3     3     3     3
    #4     4    NA    NA
    #5     5     9     9