Search code examples
ryoutubetidyverseyoutube-data-api

Use R Loop to Bulk Download Youtube Transcripts with youtubecaption


I'm trying to use the youtubecaption library to download all the transcripts for a playlist then create a dataframe with all the results.

I have a list of the video URLs and have tried to create a for loop to pass them into the get_caption() function. I can only get one video's transcripts added to the df.

I've tried a few approaches:

vids <- as.list(mydata$videoId)

for (i in 1:length(vids)){
  vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
  test_transcript2 <-
    get_caption(
     url = vids2,
     language = "en",
     savexl = FALSE,
     openxl = FALSE,
     path = getwd())
  rbind(test_transcript, test_transcript2)
 }

Also using the column of the main dataframe:

captions <- sapply(mydata[,24], FUN = get_captions)

Is there an efficient way to accomplish this?


Solution

  • In your code, you do rbind(test_transcript, test_transcript2) but never assign it, so it is lost forever. When we combine that with my comment about not using the rbind(old, newrow) paradigm, your code might be

    vids <- as.list(mydata$videoId)
    
    out <- list()
    for (i in 1:length(vids)){
      vids2 <- paste("https://www.youtube.com/watch?v=",vids[i],sep="")
      test_transcript2 <-
        get_caption(
         url = vids2,
         language = "en",
         savexl = FALSE,
         openxl = FALSE,
         path = getwd())
      out <- c(out, list(test_transcript2))
    }
    alldat <- do.call(rbind, out)
    

    Some other pointers:

    • for (i in 1:length(.)) can be a bad practice if this is functionalized, it's better to use for (i in seq_along(vids))

    • we never need the index number itself, we can use for (vid in vids)

    • we can do the pasteing in one shot, generally faster for R, with for (vid in paste0("https://www.youtube.com/watch?v=", vids)), and then url=vid in the call to get_caption

    • with all that, it might be even simpler to use lapply for the whole thing:

      path <- getwd()
      out <- lapply(paste0("https://www.youtube.com/watch?v=", vids),
                    get_caption, language = "en", savexl = FALSE,
                    openxl = FALSE, path = path)
      do.call(rbind, out)
      

    (NB: untested.)