I would be most grateful for advice. I would like to split my strings after a comma but need to preserve text within brackets containing a comma (i.e. not to split this). There are 4 possibilities in my data regarding whitespaces and commas.
1 no space after the comma within the parentheses (c,d) 2 a space after the comma in the parentheses (x, y) 3 a space after the comma outside the parentheses url.d, url.e 4 no space after the comma outside the parentheses url.d, url.e
In my example below url.b (c,d) needs to appear together as does url.h (x, y). In the code below, lines 8 and 9 need to appear together. Line 11 needs to be split.
my_df = data.frame(id=1:4, urls=c("url.a, url.b (c,d), url.c",
"url.d, url.e, url.f",
"url.g, url.h (x, y), url.i",
"url.d,url.e, url.f"))
tidytext::unnest_tokens(my_df, out, urls, token = 'regex', pattern=",\\s+")
id out
1 1 url.a
2 1 url.b (c,d)
3 1 url.c
4 2 url.d
5 2 url.e
6 2 url.f
7 3 url.g
8 3 url.h (x
9 3 y)
10 3 url.i
11 4 url.d,url.e
12 4 url.f
Thank you!
(2nd attempt after test data update)
Here's one strategy to try out:
|
)",\\s*"
for splitting, it will match all commas with optional trailing whitespacelibrary(dplyr)
library(stringr)
library(tidytext)
my_df = data.frame(id=1:4, urls=c("url.a, url.b (c,d), url.c",
"url.d, url.e, url.f",
"url.g, url.h (x, y), url.i",
"url.d,url.e, url.f"))
# before applying unnest_tokens, replace commas in parenthesis
# with a placeholder, `|`
my_df %>%
mutate(urls = str_replace_all(urls,
"\\(([^)]*)\\)",
\(match) str_replace_all(match, fixed(","), "|"))) %>%
unnest_tokens(out, urls, token = 'regex', pattern=",\\s*") %>%
# restore commas
mutate(out = str_replace_all(out, fixed("|"), ","))
#> id out
#> 1 1 url.a
#> 2 1 url.b (c,d)
#> 3 1 url.c
#> 4 2 url.d
#> 5 2 url.e
#> 6 2 url.f
#> 7 3 url.g
#> 8 3 url.h (x, y)
#> 9 3 url.i
#> 10 4 url.d
#> 11 4 url.e
#> 12 4 url.f
A closer look at str_replace_all(..., \(x) do_something(x))
,
"\\(([^)]*)\\)"
is used to find substrings that are enclosed in parentheses:
str_view("url.a, url.b (c,d, foo, bar), url.c", "\\(([^)]*)\\)")
#> [1] │ url.a, url.b <(c,d, foo, bar)>, url.c
But instead of a replacement string we'll use a replacement function that modifies our match and replaces ,
with a placeholder |
(assuming |
is not used anywhere in urls
column):
# \(match) ... notation is a shorthand for anonymous / lambda function
anon_function <- \(match) str_replace_all(match, fixed(","), "|")
anon_function("c,d, foo")
#> [1] "c|d| foo"
Adding those 2 pieces together to eliminate all commas between ():
str_replace_all(my_df$urls, "\\(([^)]*)\\)", \(match) str_replace_all(match, fixed(","), "|"))
#> [1] "url.a, url.b (c|d), url.c" "url.d, url.e, url.f"
#> [3] "url.g, url.h (x| y), url.i" "url.d,url.e, url.f"
Created on 2023-11-22 with reprex v2.0.2