Search code examples
rdplyrdata-cleaningcoalesce

Collapsing scattered information across multiple variables into 1 in R


I have some table data that has been scattered across around 1000 variables in a dataset. Most are split across 2 variables, and I can piece together the data using coalesce, however this is pretty inefficient for some variables which are instead spread across >10. Is there are a better/more efficient way?

The syntax I have written so far is:

scattered_data <- df %>%
select(id, contains("MASS9A_E2")) %>% 
#this brings in all the variables for this one question that start with this string
mutate(speciality = coalesce(MASS9A_E2_C4_1,MASS9A_E2_C4_2,MASS9A_E2_C4_3, MASS9A_E2_C4_4, MASS9A_E2_C4_5, MASS9A_E2_C4_6, MASS9A_E2_C4_7, MASS9A_E2_C4_8, MASS9A_E2_C4_9, MASS9A_E2_C5_1,MASS9A_E2_C5_2,MASS9A_E2_C5_3, MASS9A_E2_C5_4, MASS9A_E2_C5_5, MASS9A_E2_C5_6, MASS9A_E2_C5_7, MASS9A_E2_C5_8, MASS9A_E2_C5_9))

As I have this for 28 MASS questions and would really love to be able to collapse these down a bit quicker.


Solution

  • You can use do.call() to take all columns except id as input of coalesce().

    library(dplyr)
    
    df %>%
      select(id, contains("MASS9A_E2")) %>%
      mutate(speciality = do.call(coalesce, select(df, -id)))
    

    In addition, you can call coalesce() iteratively by Reduce().

    df %>%
      select(id, contains("MASS9A_E2")) %>%
      mutate(speciality = Reduce(coalesce, select(df, -id)))