I have a dataset that reflects responses to practice questions for a small group of people (~30). The dataset is very wide and has data for questions sent every day of the year. For each question that is sent I have data on: 1) the time of the response (ex. "cardiology_1_timestamp), 2) the answer given (ex. "cardiology_1), 3) the correct answer/key (ex. "cardiology_1_key), 4) and an indicator for if the respondent answered the question (ex. "cardiology_1_complete"). The naming convention for all the sets of columns follows this pattern, but the keyword changes. For example "cardiology" could be "nutrition" or "immunology". There are also inconsistent number for each category (cardio has 90, nutrition has 20, etc.), though each question in each category is identified with a sequential integer - but as seen below this integer identifier is not always the column name suffix.
Have
Record_ID | cardiology_1_timestamp | cardiology_1 | cardiology_key_1 | cardiology_1_complete |
---|---|---|---|---|
1 | 2019-07-10 12:45:49 | 1 | 4 | 2 |
2 | 2019-07-11 21:38:52 | 4 | 4 | 2 |
structure(list(Record_ID = 1:2, cardiology_1_timestamp = c("2019-07-10 12:45:49", "2019-07-11 21:38:52"), cardiology_1 = c(1L, 4L), cardiology_key_1 = c(4L, 4L), cardiology_1_complete = c(2L, 2L)), class = "data.frame", row.names = c(NA, -2L))
I would like this data in long format with 1 column each for the characteristics described (timestamp, response given, correct answer, and completed the question) as well as a new column that indicates the question. Going with the present example this would be a new column with a value of "cardiology_1".
Want
Record_ID | Question | Timestamp | Response | Key | Complete |
---|---|---|---|---|---|
1 | cardiology_1 | 2019-07-10 12:45:49 | 1 | 4 | 2 |
1 | cardiology_2 | 2019-07-11 11:20:22 | 2 | 2 | 2 |
2 | cardiology_1 | 2019-07-11 21:38:52 | 4 | 4 | 2 |
2 | cardiology_2 | 2019-07-12 01:20:46 | 2 | 2 | 2 |
structure(list(Record_ID = c(1L, 1L, 2L, 2L), Question = c("cardiology_1", "cardiology_2", "cardiology_1", "cardiology_2"), Timestamp = c("2019-07-10 12:45:49", "2019-07-11 11:20:22", "2019-07-11 21:38:52", "2019-07-12 01:20:46"), Response = c(1L, 2L, 4L, 2L), Key = c(4L, 2L, 4L, 2L), Complete = c(2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -4L))
I am relatively new to R (typically a SAS user) but am trying to get this figured out. I have attempted to utilize 'pivot_longer' but am struggling with how to deal with identifying my original columns by their names and extracting the correct portion for the new "Question" column.
library(tidyverse)
# first, we clean the column names. We do this because most of the other columns are of the form area_questionnumber_attribute, and cardiology_1 is not that
colnames(df) <- str_replace_all(colnames(df), pattern = c("key_(\\d+)" = "\\1_Key", "(?<=\\d$)" = "_Answer", "_c" = "_C", "_t" = "_T"))
# we can then simply call pivot_longer, separating the column names
df |>
pivot_longer(
cols = -Record_ID,
names_to = c("Category", "Question", ".value"),
names_sep = "_")
Output:
# A tibble: 2 × 7
Record_ID Category Question Timestamp Answer Key Complete
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 cardiology 1 2019-07-10 12:45:49 1 4 2
2 2 cardiology 1 2019-07-11 21:38:52 4 4 2
Data:
df <- data.frame(
Record_ID = c(1, 2),
cardiology_1_timestamp = c("2019-07-10 12:45:49", "2019-07-11 21:38:52"),
cardiology_1 = c(1, 4),
cardiology_key_1 = c(4, 4),
cardiology_1_complete = c(2, 2)
)