Search code examples
rstring-formatting

R: Formatting youtube video duration into proper time (seconds)


I have vector (column data) which contains youtube playback duration in a character string format in R.

x <- c(PT1H8S, PT9M55S, PT13M57S, PT1M5S, PT30M12S, PT1H21M5S, PT6M48S, PT31S, PT2M)

How do I get rid of PT then get the overall duration in seconds format?

Resultant vector should be c(3608, 595, 837, 65, 1812, 4865, 408, 31, 120)

example: PT1H21M5S in the form of seconds = 4865. (calculated as 1H = 1*3600, 21M = 21*60, 5S = 5*1)


Solution

  • I wrote a little apply loop with regex commands, deleting everything but the seconds, minutes, or hours and then converting everything into seconds.

    x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S")
    x2 <- sapply(x, function(i){
      t <- as.numeric(gsub("^(.*)M|^(.*)H|S$", "", i))
      if(grepl("M", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)H|M(.*)$", "",i)) * 60
      if(grepl("H", i)) t <- t + as.numeric(gsub("^(.*)PT|H(.*)$", "",i)) * 3600
      t
    })
    x2
       PT1H8S   PT9M55S  PT13M57S    PT1M5S  PT30M12S PT1H21M5S   PT6M48S 
     3608       595       837        65      1812      4865       408 
    

    EDIT: Per request

    x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S", "PT31S", "PT2M")
    x2 <- sapply(x, function(i){
      t <- 0
      if(grepl("S", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)M|^(.*)H|S$", "", i))
      if(grepl("M", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)H|M(.*)$", "",i)) * 60
      if(grepl("H", i)) t <- t + as.numeric(gsub("^(.*)PT|H(.*)$", "",i)) * 3600
      t
    })
    x2
       PT1H8S   PT9M55S  PT13M57S    PT1M5S  PT30M12S PT1H21M5S   PT6M48S     PT31S      PT2M 
         3608       595       837        65      1812      4865       408        31       120 
    

    This should cover all the cases. If there are more, the trick is to alter the regex. ^ is the beginning of the character vector, $ is the end. (.*) is everything. So ^(.*)H means everything between beginning and H. We replace this with nothing.