I'm working with video transcript data. The data was automatically exported with a return mid-sentence. I'd like to combine the spoken lines into a single row. The data is formatted as such:
data$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've",
"heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct",
"about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>"))
Intended output:
intendedData$transcript<-as.data.frame(c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>"))
I've tried conditional statements for rows that start with <v and end with , but that didn't work. Any ideas will be greatly appreciated. Thank you!
You could paste
the transcript together as a single long string, then use regular expressions to extract the timestamps and speech. Personally, I would want to keep these as distinct variables, but if you want you can interleave them together to give the desired output:
transcript <- c("00:00:03.990 --> 00:00:05.270",
"<v Bill>I'm here to take some notes. I've",
"heard this will be interesting.</v>",
"00:00:05.770 --> 00:00:07.370",
"<v Charlie>I believe you'll be correct",
"about that, Bill.</v>",
"00:00:10.810 --> 00:00:11.170",
"<v Bill>Awesome.</v>")
transcript <- paste(transcript, collapse = " ")
timestamp_regex <- "\\d+:\\d+:\\d+.\\d+ --> \\d+:\\d+:\\d+.\\d+"
speech_regex <- "<v .*?</v>"
timestamps <- stringr::str_extract_all(transcript, timestamp_regex)[[1]]
speech <- stringr::str_extract_all(transcript, speech_regex)[[1]]
vctrs::vec_interleave(timestamps, speech)
#> [1] "00:00:03.990 --> 00:00:05.270"
#> [2] "<v Bill>I'm here to take some notes. I've heard this will be interesting.</v>"
#> [3] "00:00:05.770 --> 00:00:07.370"
#> [4] "<v Charlie>I believe you'll be correct about that, Bill.</v>"
#> [5] "00:00:10.810 --> 00:00:11.170"
#> [6] "<v Bill>Awesome.</v>"