Search code examples
rdata-cleaningtranspose

R: Transpose Data from Wide to Long in Dataframe with many column names


I have a dataset that reflects responses to practice questions for a small group of people (~30). The dataset is very wide and has data for questions sent every day of the year. For each question that is sent I have data on: 1) the time of the response (ex. "cardiology_1_timestamp), 2) the answer given (ex. "cardiology_1), 3) the correct answer/key (ex. "cardiology_1_key), 4) and an indicator for if the respondent answered the question (ex. "cardiology_1_complete"). The naming convention for all the sets of columns follows this pattern, but the keyword changes. For example "cardiology" could be "nutrition" or "immunology". There are also inconsistent number for each category (cardio has 90, nutrition has 20, etc.), though each question in each category is identified with a sequential integer - but as seen below this integer identifier is not always the column name suffix.

Have

Record_ID cardiology_1_timestamp cardiology_1 cardiology_key_1 cardiology_1_complete
1 2019-07-10 12:45:49 1 4 2
2 2019-07-11 21:38:52 4 4 2
structure(list(Record_ID = 1:2, cardiology_1_timestamp = c("2019-07-10 12:45:49", "2019-07-11 21:38:52"), cardiology_1 = c(1L, 4L), cardiology_key_1 = c(4L, 4L), cardiology_1_complete = c(2L, 2L)), class = "data.frame", row.names = c(NA, -2L)) 

I would like this data in long format with 1 column each for the characteristics described (timestamp, response given, correct answer, and completed the question) as well as a new column that indicates the question. Going with the present example this would be a new column with a value of "cardiology_1".

Want

Record_ID Question Timestamp Response Key Complete
1 cardiology_1 2019-07-10 12:45:49 1 4 2
1 cardiology_2 2019-07-11 11:20:22 2 2 2
2 cardiology_1 2019-07-11 21:38:52 4 4 2
2 cardiology_2 2019-07-12 01:20:46 2 2 2
structure(list(Record_ID = c(1L, 1L, 2L, 2L), Question = c("cardiology_1", "cardiology_2", "cardiology_1", "cardiology_2"), Timestamp = c("2019-07-10 12:45:49", "2019-07-11 11:20:22", "2019-07-11 21:38:52", "2019-07-12 01:20:46"), Response = c(1L, 2L, 4L, 2L), Key = c(4L, 2L, 4L, 2L), Complete = c(2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -4L)) 

I am relatively new to R (typically a SAS user) but am trying to get this figured out. I have attempted to utilize 'pivot_longer' but am struggling with how to deal with identifying my original columns by their names and extracting the correct portion for the new "Question" column.


Solution

  • library(tidyverse)
    
    # first, we clean the column names. We do this because most of the other columns are of the form area_questionnumber_attribute, and cardiology_1 is not that
    colnames(df) <- str_replace_all(colnames(df), pattern = c("key_(\\d+)" = "\\1_Key", "(?<=\\d$)" = "_Answer", "_c" = "_C", "_t" = "_T"))
    
    # we can then simply call pivot_longer, separating the column names
    df |>
      pivot_longer(
        cols = -Record_ID,
        names_to = c("Category", "Question", ".value"),
        names_sep = "_")
    

    Output:

    # A tibble: 2 × 7
      Record_ID Category   Question Timestamp           Answer   Key Complete
          <dbl> <chr>      <chr>    <chr>                <dbl> <dbl>    <dbl>
    1         1 cardiology 1        2019-07-10 12:45:49      1     4        2
    2         2 cardiology 1        2019-07-11 21:38:52      4     4        2
    

    Data:

    df <- data.frame(
      Record_ID = c(1, 2),
      cardiology_1_timestamp = c("2019-07-10 12:45:49", "2019-07-11 21:38:52"),
      cardiology_1 = c(1, 4),
      cardiology_key_1 = c(4, 4),
      cardiology_1_complete = c(2, 2)
    )