Search code examples
rregexstringr

Using regex pattern to extract sentence with keyword


I have a function to find matches in a data frame (ignore t2 line, its "turned off")

library(stringr)

find.all.matches <- function(search.col,pattern){
  captured <- str_match_all(search.col,pattern = pattern)
  t <- lapply(captured, str_trim)
  #t2 <- lapply(t, function(x) gsub("[^a-z]","",x)) ##turned off
  t3 <- sapply(t, unique)
  t4 <- lapply(t3, toString)
  found.col <- unlist(t4)
  return(found.col)
}

I am running this code on a specific column in a large dataset ~20,000 rows. The column is abstracts of scientific journals.

I use the following code to add the extracted words from a pattern as a new column in the data frame

testing2 <- find.all.matches(search.col = all_data$abstract_l, 
                             pat = pattern)

all_data$testing_mu_m <- testing2

Here is a current pattern....

pattern = '\\d+(?:[.,]\\d+)*\\s*mu m\\b|ba\\b'

Which will highlight all digits before mu m along with ba in the following example abstract

a protocol for in vitro propagation of adult lavandula dentata plants has been achieved. cultures were established by placing nodal segments on murashige and skoog medium containing ba, kin, and naa. highest shoot multiplication rates were obtained when explants grown in the presence of 5.0 mu m ba or 20 mu m kin were transferred to medium with 8.8 mu m ba and 15% coconut milk. multiplication efficiency through subcultures was significantly affected by the cytokinin concentration in the initial culture medium. subculture reduced drastically the final number of shoots produced on nodal segments isolated from shoots grown in the presence of 2.0 mu m ba or 40.0 mu m kin. shoots were easily rooted on murashige and skoog hormone-free medium with macronutrients at half-strength. plants were successfully transplanted into soil. 

I am wondering, is there a way to pull out a whole sentence that contains ba? I would like a pattern that I could plug into the find.all.matches function. Desired output: cultures were established by placing nodal segments on murashige and skoog medium containing ba, kin, and naa AND highest shoot multiplication rates were obtained when explants grown in the presence of 5.0 mu m ba or 20 mu m kin were transferred to medium with 8.8 mu m ba and 15% coconut milk AND subculture reduced drastically the final number of shoots produced on nodal segments isolated from shoots grown in the presence of 2.0 mu m ba or 40.0 mu m kin.


Solution

  • You could use this regex to match entire sentences containing ba:

    (?<=^|\. )(?:(?!\.(?: |$)).)*?\bba\b.*?\.(?= |$)
    

    It matches:

    • (?<=^|\. ) : start of a sentence (character position preceded by beginning-of-string or . )
    • (?:(?!\.(?: |$)).)*? : a minimal number of characters, none of which are a . followed by a space or end-of-string (a tempered greedy token)
    • \bba\b : the word ba
    • .*?\.(?= |$) : a minimal number of characters followed by . and either a space or end-of-string.

    Regex demo on regex101

    Note that to use this in R, you will need to double all the backslashes i.e.

    (?<=^|\\. )(?:(?!\\.(?: |$)).)*?\\bba\\b.*?\\.(?= |$)