Search code examples
rregextextgsubcitations

Extracting in-text citations (character strings) from text in R


I'm trying to write a function that would allow me to paste written text, and it would return a list of the in-text citations that were used in the writing. For example, this is what I currently have:

pull_cites<- function (text){
gsub("[\\(\\)]", "", regmatches(text, gregexpr("\\(.*?\\)", text))[[1]])
    }
    
pull_cites("This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in 
    parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is 
    something I would want to be returned. I would also want multiple citations returned separately such as 
    (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015.")

And in this example, it returns

[1] "cites"                              "abbr"                               "Smith 2010"                        
[4] "Smith 2010; Jones 2001; Brown 2020" "2015"

But I would want it to return something like:

[1] "Smith 2010"
[2] "Smith 2010"                
[3] "Jones 2001"
[4] "Brown 2020"
[5] "Cooper 2015"

Any ideas on how to make this function more specific? I am using R. Thanks!


Solution

  • You can also use

    x <- "This is a test. I only want to select the (cites) in parenthesis. I do not want it to return words in parenthesis that do not have years attached, such as abbreviations (abbr). For example, citing (Smith 2010) is something I would want to be returned. I would also want multiple citations returned separately such as (Smith 2010; Jones 2001; Brown 2020). I would also want Cooper (2015) returned as Cooper 2015, and not just 2015."
    rx <- "(?:\\b(\\p{Lu}\\w*(?:\\s+\\p{Lu}\\w*)*))?\\s*\\(([^()]*\\d{4})\\)"
    library(stringr)
    res <- str_match_all(x, rx)
    result <- lapply(res, function(z) {ifelse(!is.na(z[,2]) & str_detect(z[,3],"^\\d+$"), paste(trimws(z[,2]),  trimws(z[,3])), z[,3])})    
    unlist(sapply(result, function(z) strsplit(paste(z, collapse=";"), "\\s*;\\s*")))
    ## -> [1] "Smith 2010"  "Smith 2010"  "Jones 2001"  "Brown 2020"  "Cooper 2015"
    

    See the R demo and the regex demo.

    The regex matches

    • (?:\b(\p{Lu}\w*(?:\s+\p{Lu}\w*)*))? - an optional sequence of
      • \b - a word boundary
      • (\p{Lu}\w*(?:\s+\p{Lu}\w*)*) - Group 1: an uppercase letter followed with zero or more word chars, and then zero or more sequences of one or more whitespaces and then an uppercase letter followed with zero or more word chars
    • \s* - zero or more whitespaces
    • \( - a ( char
    • ([^()]*\d{4}) - Group 2: any zero or more chars other than ( and ) and then four digits
    • \) - a ) char.

    The str_match_all(x, rx) function finds all matches and keeps the captured substrings. Then, the Group 2 and 3 values are concatenated if Group 2 is not NA and Group 3 is all digits, else, the match is used as is. Later, the items in the res variable are joined with a ; char and split with ; (enclosed with any zero or more whitespaces).