R: Transpose Data from Wide to Long in Dataframe with many column names

I have a dataset that reflects responses to practice questions for a small group of people (~30). The dataset is very wide and has data for questions sent every day of the year. For each question that is sent I have data on: 1) the time of the response (ex. "cardiology_1_timestamp), 2) the answer given (ex. "cardiology_1), 3) the correct answer/key (ex. "cardiology_1_key), 4) and an indicator for if the respondent answered the question (ex. "cardiology_1_complete"). The naming convention for all the sets of columns follows this pattern, but the keyword changes. For example "cardiology" could be "nutrition" or "immunology". There are also inconsistent number for each category (cardio has 90, nutrition has 20, etc.), though each question in each category is identified with a sequential integer - but as seen below this integer identifier is not always the column name suffix.

Have

Record_ID	cardiology_1_timestamp	cardiology_1	cardiology_key_1	cardiology_1_complete
1	2019-07-10 12:45:49	1	4	2
2	2019-07-11 21:38:52	4	4	2

structure(list(Record_ID = 1:2, cardiology_1_timestamp = c("2019-07-10 12:45:49", "2019-07-11 21:38:52"), cardiology_1 = c(1L, 4L), cardiology_key_1 = c(4L, 4L), cardiology_1_complete = c(2L, 2L)), class = "data.frame", row.names = c(NA, -2L))

I would like this data in long format with 1 column each for the characteristics described (timestamp, response given, correct answer, and completed the question) as well as a new column that indicates the question. Going with the present example this would be a new column with a value of "cardiology_1".

Want

Record_ID	Question	Timestamp	Response	Key	Complete
1	cardiology_1	2019-07-10 12:45:49	1	4	2
1	cardiology_2	2019-07-11 11:20:22	2	2	2
2	cardiology_1	2019-07-11 21:38:52	4	4	2
2	cardiology_2	2019-07-12 01:20:46	2	2	2

structure(list(Record_ID = c(1L, 1L, 2L, 2L), Question = c("cardiology_1", "cardiology_2", "cardiology_1", "cardiology_2"), Timestamp = c("2019-07-10 12:45:49", "2019-07-11 11:20:22", "2019-07-11 21:38:52", "2019-07-12 01:20:46"), Response = c(1L, 2L, 4L, 2L), Key = c(4L, 2L, 4L, 2L), Complete = c(2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -4L))

I am relatively new to R (typically a SAS user) but am trying to get this figured out. I have attempted to utilize 'pivot_longer' but am struggling with how to deal with identifying my original columns by their names and extracting the correct portion for the new "Question" column.

Solution

library(tidyverse)

# first, we clean the column names. We do this because most of the other columns are of the form area_questionnumber_attribute, and cardiology_1 is not that
colnames(df) <- str_replace_all(colnames(df), pattern = c("key_(\\d+)" = "\\1_Key", "(?<=\\d$)" = "_Answer", "_c" = "_C", "_t" = "_T"))

# we can then simply call pivot_longer, separating the column names
df |>
  pivot_longer(
    cols = -Record_ID,
    names_to = c("Category", "Question", ".value"),
    names_sep = "_")

Output:

# A tibble: 2 × 7
  Record_ID Category   Question Timestamp           Answer   Key Complete
      <dbl> <chr>      <chr>    <chr>                <dbl> <dbl>    <dbl>
1         1 cardiology 1        2019-07-10 12:45:49      1     4        2
2         2 cardiology 1        2019-07-11 21:38:52      4     4        2

Data:

df <- data.frame(
  Record_ID = c(1, 2),
  cardiology_1_timestamp = c("2019-07-10 12:45:49", "2019-07-11 21:38:52"),
  cardiology_1 = c(1, 4),
  cardiology_key_1 = c(4, 4),
  cardiology_1_complete = c(2, 2)
)