Search code examples
rdplyrtidyrstrsplit

split a string but keep together certain substrings


I want to split a character column in a data frame by certain separating characters (i.e. white spaces, commas and semicolons). However, I want to exclude certain phrases (in my example I want to exclude "my test") from the split.

I managed to get the ordinary string split, but don't know how to exclude certain phrases.

library(tidyverse)

test <- data.frame(string = c("this is a,test;but I want to exclude my test",
                              "this is another;of my tests",
                              "this is my 3rd test"),
                   stringsAsFactors = FALSE)

test %>%
  mutate(new_string = str_split(test$string, pattern = " |,|;")) %>%
  unnest_wider(new_string)

This gives:

# A tibble: 3 x 12
  string                                       ...1  ...2  ...3    ...4  ...5  ...6  ...7  ...8  ...9    ...10 ...11
  <chr>                                        <chr> <chr> <chr>   <chr> <chr> <chr> <chr> <chr> <chr>   <chr> <chr>
1 this is a,test;but I want to exclude my test this  is    a       test  but   I     want  to    exclude my    test 
2 this is another;of my tests                  this  is    another of    my    tests NA    NA    NA      NA    NA   
3 this is my 3rd test                          this  is    my      3rd   test  NA    NA    NA    NA      NA    NA

However, my desired output would be (excluding "my test"):

# A tibble: 3 x 12
  string                                       ...1  ...2  ...3    ...4  ...5      ...6  ...7  ...8  ...9    ...10
  <chr>                                        <chr> <chr> <chr>   <chr> <chr>     <chr> <chr> <chr> <chr>   <chr>
1 this is a,test;but I want to exclude my test this  is    a       test  but       I     want  to    exclude my test 
2 this is another;of my tests                  this  is    another of    my tests  NA    NA    NA    NA      NA   
3 this is my 3rd test                          this  is    my      3rd   test      NA    NA    NA    NA      NA

Any idea? (side question: any idea how I can name the columns in the unnest_wider thing?)


Solution

  • An easy workarround would be to add a _ and remove it later:

    test %>%
      mutate(string = gsub("my test", "my_test", string),
        new_string = str_split(string, pattern = "[ ,;]")) %>%
      unnest_wider(new_string) %>%
      mutate_all(~ gsub("my_test", "my test", .x))
    

    In order to give the columns more meaningful names you can use the additional options from pivot_wider.