Search code examples
rconcatenationacross

Concatenate first character substring from many columns into a new column without explicitly listing every column name


Problem Statement

I have a dataset which has 100 diagnosis fields, stored as character. They are named DiagnosisCode1 through to DiagnosisCode100. In most cases, the first diagnosis field is complete, but not always. Higher diagnosis fields are almost universally empty as NA. I need to strip the first character off each of these fields and concatenate them in a single column which will be named OnsetFlags. Because of the large number of columns I will be referring to, I would prefer not to explicitly reference all 100 fields, but refer to it either using a dplyr helper - such as starts_with() - or by using a range, like DiagnosisCode1:DiagnosisCode100.

Example Data

For the purposes of a concise demonstration, the below dataset uses 5 diagnosis columns:

example = structure(list(id = c("1", "2", "3", "4", 
"5", "6", "7"), DiagnosisCode1 = c("2G56.2", 
"2M48.06", "2G56.2", "2G56.0", "2S83.53", "2M23.20", "2S83.53"
), DiagnosisCode2 = c("2G56.0", "2G55.3", "2G56.0", "2G56.2", 
"2Y92.82", "2Z86.43", "2S83.2"), DiagnosisCode3 = c("2Z86.43", 
"2Z86.43", "2Z86.43", "1J98.1", "2V93.8", NA, "2W19"), DiagnosisCode4 = c(NA, 
"2U82.3", NA, "2U80.2", "2U73.8", NA, "2Y92.39"), DiagnosisCode5 = c(NA, 
"2U83.3", NA, NA, NA, NA, "2U56.39")), row.names = c(NA, -7L), class = "data.frame")

Which produces:

  id DiagnosisCode1 DiagnosisCode2 DiagnosisCode3 DiagnosisCode4 DiagnosisCode5
1  1         2G56.2         2G56.0        2Z86.43           <NA>           <NA>
2  2        2M48.06         2G55.3        2Z86.43         2U82.3         2U83.3
3  3         2G56.2         2G56.0        2Z86.43           <NA>           <NA>
4  4         2G56.0         2G56.2         1J98.1         2U80.2           <NA>
5  5        2S83.53        2Y92.82         2V93.8         2U73.8           <NA>
6  6        2M23.20        2Z86.43           <NA>           <NA>           <NA>
7  7        2S83.53         2S83.2           2W19        2Y92.39        2U56.39

Required Outcome

  id DiagnosisCode1 DiagnosisCode2 DiagnosisCode3 DiagnosisCode4 DiagnosisCode5 OnsetFlags
1  1          G56.2          G56.0         Z86.43           <NA>           <NA>        222
2  2         M48.06          G55.3         Z86.43          U82.3          U83.3      22222
3  3          G56.2          G56.0         Z86.43           <NA>           <NA>        222
4  4          G56.0          G56.2          J98.1          U80.2           <NA>       2212
5  5         S83.53         Y92.82          V93.8          U73.8           <NA>        222
6  6         M23.20         Z86.43           <NA>           <NA>           <NA>         22
7  7         S83.53          S83.2            W19         Y92.39         U56.39      22222

Attempts

So far, I have tried my usual bag of tricks with dplyr across():

  # create onset flag array
  example = example %>% 
    mutate(OnsetFlags = across(starts_with("DiagnosisCode"), ~ as.character(substr(., 1, 1))
      ))

However, this creates a tibble column which itself contains multiple OnsetFlags columns equal to the number of DiagnosisCode columns. I'm having difficulty pulling these out of the tibble and concatenating into a dataframe column and, honestly, it feels like it's adding a lot of overhead for what I thought should be a simple process.

The closest other examples I've found on SO were:

How can I concatenate two columns' data whilst only taking the first character of each entry to create an acronym in the new column?

...but I don't seem to be able to unite the tibble columns.

Is there an easier way of doing this because I don't feel like I'm on the right path?


Solution

  • If this absolutely must be done in the wide format, I'd avoid apply (which loops over each and every row and will bog down on large datasets), and nest to deal with the substr/unite processing in the standard vectorised R fashion. Something like this should work, though it could probably be improved.

    example %>% 
        mutate(onset = example %>% 
                           select(starts_with("DiagnosisCode")) %>%
                           map_df(substr,1,1) %>%
                           unite("onset", na.rm=TRUE, sep="")
                           %>% pull(onset)
        )
    

    The 'tidy' approach however would be to not work with wide 'untidy' data. Instead, pivot_longer this into a long, 'tidy' diagnosis dataset.
    Any summary information from your long diagnosis dataset can be joined on at the appropriate measurement level. This allows processing/analysis of grouped diagnoses and onset flags to be done in a very simple manner that doesn't get bogged down in variable prefix selections, apply/across logic, or needing to handle NA values because of the unbalanced number of diagnoses per event (note values_drop_NA=TRUE when pivoting).

    diagnosis <- example %>%
        pivot_longer(-id,
                     names_pattern=".+(\\d+)",
                     names_to="seq",
                     values_to="code",
                     values_drop_na=TRUE) %>%
        mutate(onset = as.integer(substr(code, 1, 1)),
               code  = substr(code, 2, nchar(code)))
    
    diagnosis
    ## A tibble: 26 × 4
    #   id    seq   code   onset
    #   <chr> <chr> <chr>  <int>
    # 1 1     1     G56.2      2
    # 2 1     2     G56.0      2
    # 3 1     3     Z86.43     2
    # 4 2     1     M48.06     2
    # 5 2     2     G55.3      2
    # 6 2     3     Z86.43     2
    # 7 2     4     U82.3      2
    # 8 2     5     U83.3      2
    # 9 3     1     G56.2      2
    #10 3     2     G56.0      2
    ## … with 16 more rows
    

    Your current analysis task becomes a lot simpler:

    diagnosis %>%
        group_by(id) %>%
        summarise(onset = paste(onset,collapse=""))
    
    ## A tibble: 7 × 2
    #  id    onset
    #  <chr> <chr>
    #1 1     222
    #2 2     22222
    #3 3     222
    #4 4     2212
    #5 5     2222
    #6 6     22
    #7 7     22222
    

    Though I'd argue that combining n onset values violates most good practices for organising data where each cell should contain only one value, making future analysis more difficult as you'd have to then strsplit the values to make any sense of them. Better to leave it in the long format and analyse them as they are.

    Here's another example of how to deal with other more complicated analyses in the long form:

    diagnosis %>%
        group_by(id) %>%
        summarise(onset1 = any(onset == 1),
                  G5and2 = any(substr(code,1,2) == 'G5' & onset == 2))
    
    ## A tibble: 7 × 3
    #  id    onset1 G5and2
    #  <chr> <lgl>  <lgl>
    #1 1     FALSE  TRUE
    #2 2     FALSE  TRUE
    #3 3     FALSE  TRUE
    #4 4     TRUE   TRUE
    #5 5     FALSE  FALSE
    #6 6     FALSE  FALSE
    #7 7     FALSE  FALSE