Search code examples
rdplyrtidytext

Using unnest_tokens() to split a column selectively (no split if comma inside a bracket)


I would be most grateful for advice. I would like to split my strings after a comma but need to preserve text within brackets containing a comma (i.e. not to split this). There are 4 possibilities in my data regarding whitespaces and commas.

1 no space after the comma within the parentheses (c,d) 2 a space after the comma in the parentheses (x, y) 3 a space after the comma outside the parentheses url.d, url.e 4 no space after the comma outside the parentheses url.d, url.e

In my example below url.b (c,d) needs to appear together as does url.h (x, y). In the code below, lines 8 and 9 need to appear together. Line 11 needs to be split.

my_df = data.frame(id=1:4, urls=c("url.a, url.b (c,d), url.c",
                                  "url.d, url.e, url.f",
                                  "url.g, url.h (x, y), url.i",
                                  "url.d,url.e, url.f"))


tidytext::unnest_tokens(my_df, out, urls, token = 'regex', pattern=",\\s+")

   id         out
1   1       url.a
2   1 url.b (c,d)
3   1       url.c
4   2       url.d
5   2       url.e
6   2       url.f
7   3       url.g
8   3    url.h (x
9   3          y)
10  3       url.i
11  4 url.d,url.e
12  4       url.f

Thank you!


Solution

  • (2nd attempt after test data update)

    Here's one strategy to try out:

    • use a placeholder character for commas in parentheses (let's pick |)
    • use ",\\s*" for splitting, it will match all commas with optional trailing whitespace
    • restore commas
    library(dplyr)
    library(stringr)
    library(tidytext)
    
    my_df = data.frame(id=1:4, urls=c("url.a, url.b (c,d), url.c",
                                      "url.d, url.e, url.f",
                                      "url.g, url.h (x, y), url.i",
                                      "url.d,url.e, url.f"))
    
    # before applying unnest_tokens, replace commas in parenthesis 
    # with a placeholder, `|`
    my_df %>% 
      mutate(urls = str_replace_all(urls, 
                                    "\\(([^)]*)\\)", 
                                    \(match) str_replace_all(match, fixed(","), "|"))) %>% 
      unnest_tokens(out, urls, token = 'regex', pattern=",\\s*") %>% 
      # restore commas
      mutate(out = str_replace_all(out, fixed("|"), ","))
    #>    id          out
    #> 1   1        url.a
    #> 2   1  url.b (c,d)
    #> 3   1        url.c
    #> 4   2        url.d
    #> 5   2        url.e
    #> 6   2        url.f
    #> 7   3        url.g
    #> 8   3 url.h (x, y)
    #> 9   3        url.i
    #> 10  4        url.d
    #> 11  4        url.e
    #> 12  4        url.f
    
    

    A closer look at str_replace_all(..., \(x) do_something(x)) , "\\(([^)]*)\\)" is used to find substrings that are enclosed in parentheses:

    str_view("url.a, url.b (c,d, foo, bar), url.c", "\\(([^)]*)\\)")
    #> [1] │ url.a, url.b <(c,d, foo, bar)>, url.c
    

    But instead of a replacement string we'll use a replacement function that modifies our match and replaces , with a placeholder | (assuming | is not used anywhere in urls column):

    # \(match) ... notation is a shorthand for anonymous / lambda function
    anon_function <- \(match) str_replace_all(match, fixed(","), "|")
    anon_function("c,d, foo")
    #> [1] "c|d| foo"
    

    Adding those 2 pieces together to eliminate all commas between ():

    str_replace_all(my_df$urls, "\\(([^)]*)\\)", \(match) str_replace_all(match, fixed(","), "|"))
    #> [1] "url.a, url.b (c|d), url.c"  "url.d, url.e, url.f"       
    #> [3] "url.g, url.h (x| y), url.i" "url.d,url.e, url.f"
    

    Created on 2023-11-22 with reprex v2.0.2