Search code examples
rstringtext-mining

How to mine multiwords by taking into account their position in the text?


I want to extract certain words positioned between years and the following comma in a given text. Although the term Mining appears before & after 2020 in text, I need the later one which is found between (2020) and ,. The same concept apply for the term Computer Science in the following text.

library(stringr)
text <- "This is text Mining exercise (2020) Mining, p. 628508; Computer Science text analysis (1998) Computer Science, p.345-355; Introduction to data mining (2015) J. Data Science, pp. 31-33"
comp <- c("Mining", "Computer Science", "J. Data Science")
pattern <- str_c(comp,collapse ="|")
data <- str_extract_all(text, pattern)

The last line of the above code gives an output of:

[1] "Mining" "Mining" "Computer Science" "Computer Science" "J. Data Science" 

The output that I'm looking for is:

[1] "Mining" "Computer Science" "J. Data Science" 

Note: The position of those words matter. Any help is highly appreciated!


Solution

  • If we need to extract between the ) after the digit and the ,, create a regex lookaround

    library(stringr)
    str_extract_all(text, str_c("(?<=\\(\\d{4}\\)\\s)(", pattern, ")(?=,)"))[[1]]
    #[1] "Mining"           "Computer Science" "J. Data Science"