Search code examples
rregexdataframetidyrdelimiter

Separating with multiple alternative delimiters in tidyr


I have a column of inconsistently formatted strings that I am trying to split using tidyr::separate_wider_delim(). These either have the pattern "text[space]text", or "text[space]text[comma][space]text" (see example below). In the former, I want to split at the space; in the latter I want to split at the comma/space combination (which is always going to be the second occurrence of a space).

Either of these situations is easy using separate_wider_delim(). However, I can't figure out how to do both at the same time. Doing them in sequence also doesn't work, because the space is consistent between them.

library(tidyr)

df <- data.frame(txt = c("December 2005", "January 2008", "January 3, 2001", "December 18, 2004"))
  
df %>%
  separate_wider_delim(txt, delim = ", ",
                       names = c("month", "year"),
                       too_few = "align_start")

What I'm trying to get to is:

month         year
December      2005
January       2008
January 3     2001
December 18   2004

I can use regex to identify the location of the second space:

str_locate_all(txt, "\\s")[[1]][2, 2]

But I can't figure out how to use this to define the delimiter. I could probably just grab the last four characters pretty easily, but my real data is messier than this, so I would greatly prefer something that relies on the delimiter (if only one space, then the space; if multiple spaces, then the second space/the space following the comma).


Solution

  • Staying with delim-based splitting you can use a regex to define the delimeter as:

    • ,?: an optional comma
    • \s+: followed by the space
    • (?= ... ): before
      • \d{4}: the 4-digit number (year)
      • $: that comes before the end of string

    See regex demo at regex101

    
    library(tidyr)
    library(stringr)
    
    df <- data.frame(
      txt = c(
        "December 2005", 
        "January 2008", 
        "January 3, 2001", 
        "December 18, 2004"
        )
      )
    
    df %>%
      separate_wider_delim(
        txt, 
        delim = stringr::regex(",?\\s+(?=\\d{4}$)"),
        names = c("month", "year"),
        too_few = "align_start"
        )
    
    # # A tibble: 4 × 2
    # month       year 
    # <chr>       <chr>
    # 1 December    2005 
    # 2 January     2008 
    # 3 January 3   2001 
    # 4 December 18 2004