Search code examples
rregexdataframetidyrstringr

Match string between two colons in multiline string


I am using stringr to extract the date and the train identifier for each row in a dataframe from the variable status. The string should have the following format:

2008-07-01 : Train_528 :cancelled
2005-11-01 : Train_528 :postponed
2005-06-28 : Train_528 :ontime

I use the following code to extract train id and date:

train_df %>% 
  separate_rows(status, sep = "\\n") %>% 
  mutate(date = as.Date(str_extract(status, "\\d.*\\d")),
         train_id = str_extract(status, "(?<=:)\\w.*(?= :ontime)"))

The code works successfully. However, in some cases, the data is formatted incorrectly where the train ID should have been between the date and the status.

2008-07-01 : :cancelled
2005-11-01 : :postponed
2005-06-28 : :ontime
 :Train_528 :cancelled
 :Train_528 :postponed
 :Train_528 :ontime

The main way to identify this issue is to match two colons with no characters in between:": :". What pattern can I use to pull the train ID without matching the train status string.

I tried using the following code but failed:

train_df %>% 
  separate_rows(status, sep = "\\n") %>%
  mutate(status_duplicated = status) %>% 
  mutate(date = as.Date(str_extract(status, "\\d.*\\d")),
         train_id = if_else(str_detect(status, ":\\s:"), 
                            str_extract(status_duplicated, "(?<=:)\\w.*(?= :ontime)"),
                            str_extract(status, "(?<=:)\\w.*(?= :)")))

Reprex:

train_df <- structure(list(county = 1:3, status = c("2008-07-01 : :cancelled\n2005-11-01 : :postponed\n2005-06-28 : :ontime\n :Train_528 :cancelled\n :Train_528 :postponed\n :Train_528 :ontime", 
"2017-01-13 :Train_222 :ontime\n2016-09-30 :Train_222 :postponed\n2016-09-14 :Train_222 :cancelled\n2014-08-07 :TR 1323 :cancelled\n :TR 1323 :postponed", 
"1985-05-18 :Train_12 :ontime\n1981-12-15 :Train_12 :postponed"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L))

Solution

  • Your expected outcome is implicit. Here is a possible solution:

    library(tidyverse)
    
    train_df %>%
      separate_rows(status, sep = "\n") %>%
      separate(status, c("date", "train_id", "status"), sep = "\\s*:\\s*") %>%
      mutate(across(-county, na_if, "")) %>%
      fill(date, train_id, .direction = "downup")
    
    # A tibble: 13 × 4
       county date       train_id  status   
        <int> <chr>      <chr>     <chr>    
     1      1 2008-07-01 Train_528 cancelled
     2      1 2005-11-01 Train_528 postponed
     3      1 2005-06-28 Train_528 ontime   
     4      1 2005-06-28 Train_528 cancelled
     5      1 2005-06-28 Train_528 postponed
     6      1 2005-06-28 Train_528 ontime   
     7      2 2017-01-13 Train_222 ontime   
     8      2 2016-09-30 Train_222 postponed
     9      2 2016-09-14 Train_222 cancelled
    10      2 2014-08-07 TR 1323   cancelled
    11      2 2014-08-07 TR 1323   postponed
    12      3 1985-05-18 Train_12  ontime   
    13      3 1981-12-15 Train_12  postponed