Search code examples
rloopstidyversedata-manipulationdata-cleaning

How to concatenate the string of a header (or cell), to the value of another cell?


I am cleaning a big dataset, that contains dates (t0:t5 and no_BL:no_48) and test scores (MMSE_BL: MMSE_FU48).

LONG STORY SHORT, I need this date: "27-09-12", to look like this "27-09-12_FU24_28",

  • where the _FU24 come from a header (columns no_BL : no_FU48).
  • and _28 comes from a score value (columns MMSE_BL_MMSE_FU48).

Longer explanation: In this example, I am showing my first 2 rows and 18 columns. Probably the best way to tell you my goal is to show how I would like the outcome to look like:

### What I have
cnames <- c("t0", "t1", "t2", "t3", "t4", "t5", "no_BL", "no_FU6", "no_Fu12", "no_FU24", "no_FU36", "no_FU48", "MMSE_BL","MMSE_FU6" ,"MMSE_FU12", "MMSE_FU24", "MMSE_36", "MMSE_FU48")
one <- c("27-09-12", "NA",   "25-07-13", "NA", "NA", "NA", "NA", "NA", "NA", "27-09-12", "25-07-13", "NA", "29", "NA",  "28",   "28",   "29",   "30")
two <- c("06-11-12", "NA",   "01-10-13", "NA",   "NA",   "NA", "NA",   "NA",   "NA",   "06-11-12", "01-10-13", "NA",   "30",  "NA", "29",   "30",   "30",   "30")

df <- matrix(c(one, two), ncol = 18, byrow = TRUE )  
colnames(df) <- c(cnames)
df <- as.data.frame(df)
View(df)

##### What I need
#OUTCOME
one_ <- c("27-09-12_FU24_28", "NA",   "25-07-13_FU36_29", "NA", "NA", "NA", "NA", "NA", "NA", "27-09-12", "25-07-13", "NA", "29",  "NA", "28",   "28",   "29",   "30")
two_ <- c("06-11-12_FU24_30", "NA",   "01-10-13_FU36_30", "NA",   "NA",   "NA", "NA",   "NA",   "NA",   "06-11-12", "01-10-13", "NA", "NA"  ,"30",   "29",   "30",   "30",   "30")

outcome <- matrix(c(one_, two_), ncol = 18, byrow = TRUE )  
colnames(outcome) <- c(cnames)
outcome <- as.data.frame(outcome)
View(outcome)

It is a 2 step proccess,

First I would like to take the dates contained in columns t0:t5, go through columns no_BL:no_48, find their date match, and attach to the dates from t0:t5 columns the last characters of its date-belonging header.

  • For example, "27-09-12" from column t0, is also found in column no_FU24, so I need to attach the last 5 digits of the heather(no_FU24) to the date, looking like this = 27-09-12_FU24.

Second after all the first 6 columns are settled, I have to start an iteration again, but this time attaching the scores to the end of the dates.

  • For example, "27-09-12_FU24" ends with _FU24, which means that from the column MMSE_FU24 I have to copy the score (28) and attach it to the date value, looking like this at the end = "27-09-12_FU24_28"

I was also contemplating the idea of Header=False, and work with it as a regular row.

Please help! and thank you in advance : )


Solution

  • I agree with Ronak's both points here. Though Ronak has proposed a fantastic answer which I can never imagine, yet I worked on your problem and so I propose alternate solution which is nowhere as good as that answer. It works

    df %>% mutate(i = row_number()) %>%
      pivot_longer(cols = contains("t"), names_to = "n", values_to = "v") %>%
      pivot_longer(cols = contains("MM")) %>%
      filter(!is.na(v)) %>%
      rowwise() %>%
      mutate(new = names(.)[which(c_across(1:6) == v)],
             new = str_remove(new, "no_"),
             new2 = case_when(str_detect(name, new) ~ value,
                              TRUE ~ NA_character_),
             v = paste(v, new, new2, sep = "_")) %>%
      filter(!is.na(new2)) %>%
      select(i, n, v) %>%
      pivot_wider(names_from = n, values_from = v) %>%
      left_join(df[-c(1:6)] %>% mutate(i = row_number()), by = "i") %>%
      select(-i)
    
    # A tibble: 2 x 14
      t0    t2    no_BL no_FU6 no_Fu12 no_FU24 no_FU36 no_FU48 MMSE_BL MMSE_FU6 MMSE_FU12 MMSE_FU24 MMSE_FU36
      <chr> <chr> <chr> <chr>  <chr>   <chr>   <chr>   <chr>   <chr>   <chr>    <chr>     <chr>     <chr>    
    1 27-0~ 25-0~ NA    NA     NA      27-09-~ 25-07-~ NA      29      NA       28        28        29       
    2 06-1~ 01-1~ NA    NA     NA      06-11-~ 01-10-~ NA      30      NA       29        30        30       
    # ... with 1 more variable: MMSE_FU48 <chr>
    

    Note Whenver there is sufficient data, the t_cols will automatically find a place in this syntax

    modified df

    > dput(df)
    structure(list(t0 = c("27-09-12", "06-11-12"), t1 = c(NA_character_, 
    NA_character_), t2 = c("25-07-13", "01-10-13"), t3 = c(NA_character_, 
    NA_character_), t4 = c(NA_character_, NA_character_), t5 = c(NA_character_, 
    NA_character_), no_BL = c(NA_character_, NA_character_), no_FU6 = c(NA_character_, 
    NA_character_), no_Fu12 = c(NA_character_, NA_character_), no_FU24 = c("27-09-12", 
    "06-11-12"), no_FU36 = c("25-07-13", "01-10-13"), no_FU48 = c(NA_character_, 
    NA_character_), MMSE_BL = c("29", "30"), MMSE_FU6 = c(NA_character_, 
    NA_character_), MMSE_FU12 = c("28", "29"), MMSE_FU24 = c("28", 
    "30"), MMSE_FU36 = c("29", "30"), MMSE_FU48 = c("30", "30")), class = "data.frame", row.names = c(NA, 
    -2L))