Search code examples
rstringrbibtex

R command to extract text between two strings containing curly parentheses


I am trying to use the R str_match function from the stringr library to extract the title in bibliographical entries like the following. Indeed, I need to extract the text between the
"title={" and the "}," strings.

a2
[1] "@article{2020, title={Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR‐421 and E‐cadherin}, volume={9}, ISSN={2045-7634}, url={http://dx.doi.org/10.1002/cam4.3002}, DOI={10.1002/cam4.3002}, number={11}, journal={Cancer Medicine}, publisher={Wiley}, author={Ji, Yefeng and Feng, Guanying and Hou, Yunwen and Yu, Yang and Wang, Ruixia and Yuan, Hua}, year={2020}, month={Apr}, pages={3954–3963} }"

I have used approaches like the following, but I get an error message:

str_match(a2, "(?s)title={\\s*(.*?)\\s*},.")

Error in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) :
Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=(?s)title={\s*(.*?)\s*},.)

I guess the problem is with the matching of the curly parentheses, but I couldn't make any progress. Any pointer would be greatly appreciated.


Solution

  • Use the following regex.

    a2 <- "@article{2020, title={Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin}, volume={9}, ISSN={2045-7634}, url={http://dx.doi.org/10.1002/cam4.3002}, DOI={10.1002/cam4.3002}, number={11}, journal={Cancer Medicine}, publisher={Wiley}, author={Ji, Yefeng and Feng, Guanying and Hou, Yunwen and Yu, Yang and Wang, Ruixia and Yuan, Hua}, year={2020}, month={Apr}, pages={3954–3963} }"
    
    sub("^.*title=\\{([^{}]+)\\}.*$", "\\1", a2)
    #> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"
    

    Created on 2022-03-19 by the reprex package (v2.0.1)


    Edit

    Alternative stringr way.

    stringr::str_match(a2, "^.*title=\\{([^{}]+)\\}.*$")[,2]
    #> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"
    

    Created on 2022-03-19 by the reprex package (v2.0.1)