I'm trying to use the youtubecaption
library to download all the transcripts for a playlist then create a dataframe with all the results.
I have a list of the video URLs and have tried to create a for loop to pass them into the get_caption()
function. I can only get one video's transcripts added to the df.
I've tried a few approaches:
vids <- as.list(mydata$videoId)
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
rbind(test_transcript, test_transcript2)
}
Also using the column of the main dataframe:
captions <- sapply(mydata[,24], FUN = get_captions)
Is there an efficient way to accomplish this?
In your code, you do rbind(test_transcript, test_transcript2)
but never assign it, so it is lost forever. When we combine that with my comment about not using the rbind(old, newrow)
paradigm, your code might be
vids <- as.list(mydata$videoId)
out <- list()
for (i in 1:length(vids)){
vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
test_transcript2 <-
get_caption(
url = vids2,
language = "en",
savexl = FALSE,
openxl = FALSE,
path = getwd())
out <- c(out, list(test_transcript2))
}
alldat <- do.call(rbind, out)
Some other pointers:
for (i in 1:length(.))
can be a bad practice if this is functionalized, it's better to use for (i in seq_along(vids))
we never need the index number itself, we can use for (vid in vids)
we can do the paste
ing in one shot, generally faster for R, with for (vid in paste0("https://www.youtube.com/watch?v=", vids))
, and then url=vid
in the call to get_caption
with all that, it might be even simpler to use lapply
for the whole thing:
path <- getwd()
out <- lapply(paste0("https://www.youtube.com/watch?v=", vids),
get_caption, language = "en", savexl = FALSE,
openxl = FALSE, path = path)
do.call(rbind, out)
(NB: untested.)