Search code examples
rvariablesna

How do I create a variable that tells me which of a number of other variables is the first one to not have a missing value for one observation?


If I have the following data structure in my data frame df:

a  b  c  d

1  2  3  4
NA NA 1  2
NA 1  2  NA
NA NA NA 1

how can I create a variable that tells me, which of the variables is the first one to not have a missing value, such that:

a  b  c  d  var

1  2  3  4  a
NA NA 1  2  c
NA 1  2  NA b
NA NA NA 1  d

I need the code to work with variable names and not column numbers, because of the size of the dataset and changing the order of the variables.

I have tried:

df <- df %>% mutate(var = coalesce(deparse(substitute(a)), deparse(substitute(b)), deparse(substitute(c)), deparse(substitute(d))))

and

df <- df %>% mutate(var = deparse(substitute(do.call(coalesce, across(c(a, b, c, d))))))

trying to implement this approach. I got the code to extract the string of a variable name from: How to convert variable (object) name into String


Solution

  • We can use max.col, i.e.

    names(df)[max.col(!is.na(df), ties.method = 'first')]
    #[1] "a" "c" "b" "d"
    

    Via dplyr,

    library(dplyr)
    
    df %>% 
     mutate(var = names(.)[max.col(!is.na(.), ties.method = 'first')])
    
       a  b  c  d var
    1  1  2  3  4   a
    2 NA NA  1  2   c
    3 NA  1  2 NA   b
    4 NA NA NA  1   d
    

    You can specify variables

    df %>% 
     mutate(var = names(.[c('a', 'b')])[max.col(!is.na(.[c('a', 'b')]), ties.method = 'first')])
       a  b  c  d var
    1  1  2  3  4   a
    2 NA NA  1  2   a
    3 NA  1  2 NA   b
    4 NA NA NA  1   a
    

    You can also change the order of the variables via select(), i.e.

    df %>% 
     select(c, d, b, a) %>%
     mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')])
    
       c  d  b  a new
    1  3  4  2  1   c
    2  1  2 NA NA   c
    3  2 NA  1 NA   c
    4 NA  1 NA NA   d
    

    You can also call select() again to restore the original order of columns but maintain the result from previous order, i.e.

    df %>% 
     select(c, d, b, a) %>%
     mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
     select(names(df), new)
    
       a  b  c  d new
    1  1  2  3  4   c
    2 NA NA  1  2   c
    3 NA  1  2 NA   c
    4 NA NA NA  1   d
    

    To maintain all the variables at the end, you can join on the original data frame, i.e.

    df %>% 
     select(c, d, b) %>%
     mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
     left_join(df) %>% 
     select(names(df), new)
    
    Joining, by = c("c", "d", "b")
       a  b  c  d new
    1  1  2  3  4   c
    2 NA NA  1  2   c
    3 NA  1  2 NA   c
    4 NA NA NA  1   d