Search code examples
rstringrstrsplit

How to count years in the text in R?


I want to count years found between the opening and closing brackets in the following text named txt.

library(stringr)
txt <- "Text Mining exercise (2020) Mining, p. 628508; Computer Science text analysis (1998) Computer Science, p.345-355; Introduction to data mining (2015) J. Data Science, pp. 31-33"

lengths(strsplit(txt,"\\(\\d{4}\\)")) gives me 4 which is wrong. Any help, please?


Solution

  • I think you are looking for stringr::str_count():

    str_count(txt, "\\([0-9]{4}\\)")
    [1] 3
    

    To include only number of four digits within parentheses that also start with 1 or 2 followed by either 0 or 9:

    str_count(txt, "\\([1-2][0|9][0-9]{2}\\)")
    

    Strictly starting with either 19 or 20:

    str_count(txt, "\\(19[0-9]{2}\\)|\\(20[0-9]{2}\\)")
    # In R 4.0
    str_count(txt, r"(\(19[0-9]{2}\)|\(20[0-9]{2}\))")