Search code examples
rtidytext

Using unnest_tokens() to split a column by a specific character?


I'm working with a column of vectors of urls formatted as a string with each url separated by a comma:

column_with_urls

["url.a, url.b, url.c"]

["url.d, url.e, url.f"]

I would like to use the tidytext::unnest_tokens() R function to separate these out into one url per line (although I'm open to other preferably R based solutions). I've read the docs here but I can't tell if it's possible/advisable to enter a single character to split on.

My thought is something like unnest_tokens(url, column_with_urls, by = ','). Is there a way to specify that kind of argument and/or a better way to solve this problem?

My desired output is a dataframe with one url per row like this (and all other data for the original rows copied over to each row):

url

url.a

url.b

url.c

...

Thanks in advance.


Solution

  • The unnest_tokens function has an option for you to split on a regex pattern. Below is the example syntax to split on a comma using this option (you could also use it for more complex patterns).

    Note that this will convert the class of your input data to a tibble

    my_df = data.frame(id=1:2, urls=c("url.a, url.b, url.c",
                                      "url.d, url.e, url.f"))
    tidytext::unnest_tokens(my_df, out, urls, token = 'regex', pattern=",")
    # # A tibble: 6 × 2
    #     id    out
    #   <int>  <chr>
    # 1     1  url.a
    # 2     1  url.b
    # 3     1  url.c
    # 4     2  url.d
    # 5     2  url.e
    # 6     2  url.f