I want to split a character column in a data frame by certain separating characters (i.e. white spaces, commas and semicolons). However, I want to exclude certain phrases (in my example I want to exclude "my test") from the split.
I managed to get the ordinary string split, but don't know how to exclude certain phrases.
library(tidyverse)
test <- data.frame(string = c("this is a,test;but I want to exclude my test",
"this is another;of my tests",
"this is my 3rd test"),
stringsAsFactors = FALSE)
test %>%
mutate(new_string = str_split(test$string, pattern = " |,|;")) %>%
unnest_wider(new_string)
This gives:
# A tibble: 3 x 12
string ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 this is a,test;but I want to exclude my test this is a test but I want to exclude my test
2 this is another;of my tests this is another of my tests NA NA NA NA NA
3 this is my 3rd test this is my 3rd test NA NA NA NA NA NA
However, my desired output would be (excluding "my test"):
# A tibble: 3 x 12
string ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 this is a,test;but I want to exclude my test this is a test but I want to exclude my test
2 this is another;of my tests this is another of my tests NA NA NA NA NA
3 this is my 3rd test this is my 3rd test NA NA NA NA NA
Any idea? (side question: any idea how I can name the columns in the unnest_wider thing?)
An easy workarround would be to add a _
and remove it later:
test %>%
mutate(string = gsub("my test", "my_test", string),
new_string = str_split(string, pattern = "[ ,;]")) %>%
unnest_wider(new_string) %>%
mutate_all(~ gsub("my_test", "my test", .x))
In order to give the columns more meaningful names you can use the additional options from pivot_wider
.