Search code examples
rregexextracttidyrdata-cleaning

Why can't I split this column using tidyr extract and regex?


First time asking a question on here, so I apologise if I have missed something. I have been looking through existing answers and couldn't find any that address this issue specifically.

I'm trying to split inconsistent strings into two variables using the extract function of the tidyr package.

Reprex of my data with library calls:

library(tidyverse)


date <- c("2017-09-30", "2017-10-02", "2017-10-05", "2017-10-07")
exercise <- c("gym", "run", "gym", "walk")
duration <- c("1h 31m", "45m", "1h 42m", "2h")
raw_data <- tibble(date, exercise, duration)

The strings I would like to extract into two separate variables are the numbers in the 'duration' column. Splitting them into columns titled 'hours' and 'minutes', resulting in this:

date <- c("2017-09-30", "2017-10-02", "2017-10-05", "2017-10-07")
exercise <- c("gym", "run", "gym", "walk")
hours <- c("1", "NA", "1", "2")
minutes <- c("31", "45", "42", "NA")
raw_data <- tibble(date, exercise, hours, minutes)

...I then plan on using a mutate call to store the entire durations in minutes (eg. c("91", "45", "102", "120")

My attempt so far:

clean_data <- raw_data %>%
   extract(duration,
           c("hours", "minutes"),
           "([\\d]+(?=h)) \\s ([\\d]+(?=m))")

All this does is produce two columns, correctly named 'hours' and 'minutes' but incorrectly filled with NAs.

I am not familiar with regex so I assume the problem is in there somewhere, I tried the following and it does extract the correct values:

str_extract_all(raw_data$duration, "[\\d]+(?=h)")
str_extract_all(raw_data$duration, "[\\d]+(?=m)")

So I would guess my mistake is either something to do with the whitespace, but I've exhausted all options I can come up with; or something to do with the inconsistency of the initial strings. All existing stack overflow answers that I found have consistent strings.

Can anyone see where I'm going wrong? Thanks in advance!


Solution

  • You used lookarounds that are non-consuming patterns, while you need to use consuming pattern to let the regex engine reach minutes after hours.

    You can solve the problem using

    pattern <- "^(?!$)(?:(\\d+)h\\h*)?(?:(\\d+)m)?$"
    clean_data <- raw_data %>%
          extract(duration,c("hours", "minutes"), pattern)
    clean_data
    #   A tibble: 4 x 4
    #   date       exercise hours minutes
    #   <chr>      <chr>    <chr> <chr>  
    # 1 2017-09-30 gym      "1"   "31"   
    # 2 2017-10-02 run      ""    "45"   
    # 3 2017-10-05 gym      "1"   "42"   
    # 4 2017-10-07 walk     "2"   ""   
    

    See the regex demo. Details:

    • ^ - string start
    • (?!$) - no string end at the string start position allowed
    • (?:(\d+)h\h*)? - an optional sequence of
      • (\d+) - Group 1: one or more digits
      • h\h* - h and zero or more horizontal whitespace
    • (?:(\d+)m)? - an optional sequence of one or more digits (Group 2) and then m
    • $ - end of string.