An example data frame:
df <- data.frame(ID = c(1, 1, 2, 2, 2, 2, 2),
Name = c("Alice", "Alice", "Bob", "Bob", "Bob", "Bob", "Bob"),
Age = c(25, 25, 30, 30, 30, 30, 30),
LINE = c(1, 2, 1, 2, 1, 2, 3),
NOTE_TEXT = c("This is the fir", "st note",
"This is the seco", "nd note",
"This is ", "the th", "ird note"))
Essentially my complete "Note" due to character constraints from the source pull of data is split into substrings of "NOTE_TEXT" across multiple rows. The substrings that belong to the same "Note" are consecutively listed by another variable called "LINE" which can be anywhere from 1 to 65 (the vast majority are within 4 lines). I want to consolidate the "NOTE_TEXT"s belonging to the same "Note" into one line, and also make a new variable to denote the uniqueness of each "Note" belonging to the same set of ID, Name, Age, variables.
The resultant dataframe would look like this:
data.frame(ID = c(1, 2, 2),
Name = c("Alice", "Bob", "Bob"),
Age = c(25, 30, 30),
Note = c(1, 1, 2),
NOTE_TEXT = c("This is the first note",
"This is the second note",
"This is the third note"))
I imagine I need to use some sort of for loop to loop through "LINE" for each set of unique variables, but I'm not sure where to start. Thanks for any help!
df |>
mutate(note_num = cumsum(LINE == 1), .by = c(ID, Name, Age)) |>
summarize(NOTE_TEXT = paste(NOTE_TEXT, collapse = ""),
.by = c(ID, Name, Age, note_num))
Result
ID Name Age note_num NOTE_TEXT
1 1 Alice 25 1 This is the first note
2 2 Bob 30 1 This is the second note
3 2 Bob 30 2 This is the third note