Search code examples
rextractstrsplit

Extracting time from middle of string in R when string changes c


I have a dataset containing training data for multiple athletes on different days/times. One column contains the date and start time of the session. I only wish to keep the start time in this column i.e. i want to remove "2020/01/05" and "UTC". How to I remove everything before and after the time (there are 4 million rows with varying dates/times).

 start.time
1 2020/01/05 21:30:04 UTC 
2 2020/01/05 21:30:04 UTC 
3 2020/01/05 21:30:04 UTC 
4 2020/01/05 21:30:04 UTC 
5 2020/01/05 21:30:04 UTC 
6 2020/01/05 21:30:04 UTC 

Sorry this has probably already been answered somewhere.

Thanks


Solution

  • Multiple ways to do this :

    1) Using regex

    df$time <- sub('.*\\s+(.*) UTC', '\\1', df$start.time)
    df
    #               start.time     time
    #1 2020/01/05 21:30:04 UTC 21:30:04
    #2 2020/01/05 21:30:04 UTC 21:30:04
    #3 2020/01/05 21:30:04 UTC 21:30:04
    #4 2020/01/05 21:30:04 UTC 21:30:04
    #5 2020/01/05 21:30:04 UTC 21:30:04
    #6 2020/01/05 21:30:04 UTC 21:30:04
    

    Here, we capture everything between whitespace and "UTC". \\1 is used as a backreference to capture the extracted value.


    2) Convert to POSIXct and then format

    This can be done in base R :

    format(as.POSIXct(df$start.time, format = "%Y/%m/%d %T"), "%T")
    

    Or using lubridate

    format(lubridate::ymd_hms(df$start.time), "%T")
    

    data

    df <- structure(list(start.time = structure(c(1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = "2020/01/05 21:30:04 UTC", class = "factor")), 
    class = "data.frame", row.names = c(NA,-6L))