Search code examples
rstringdataframesubstring

R - removing substring in column of strings based on pattern and condition


I have a column of strings in a data frame where I would like to replace the values to include only the substring before the first " (", i.e., before the first space/open bracket pair. Not all of the strings contain brackets, and I want those to be left as they are.

Example data:

col1 <- c(1, 2, 3, 4)
col2 <- c("a b (ABC DE)", "bcd", "cd ef (CE)", "bcd")
df <- data.frame(col1, col2)
df

Output:

  col1       col2
1    1 a b (ABC DE)
2    2        bcd
3    3  cd ef (CE)
4    4        bcd

The output I'm looking for would be something like this:

col1 <- c(1, 2, 3, 4)
col2 <- c("a b", "bcd", "cd ef", "bcd")
df <- data.frame(col1, col2)
df

Output:

  col1 col2
1    1  a b
2    2  bcd
3    3 cd ef
4    4  bcd

The actual data frame is 40000+ rows with the strings taking many possible values, so it can't be done manually like in the example. I'm not confident at all working with regex/patterns, but accept this may be the most straightforward way to do this.


Solution

  • A possible solution, based on stringr:

    library(tidyverse)
    
    df %>% 
      mutate(col2 = str_remove_all(col2, "\\s*\\(.*\\)\\s*"))
    
    #>   col1  col2
    #> 1    1   a b
    #> 2    2   bcd
    #> 3    3 cd ef
    #> 4    4   bcd