Search code examples
rregexstringr

Extracting substring with regex in r, output list with matrix


I want to extract substrings from a string in r. I tested it with regex101, it does extract the substring I want to have, but it also matches every single other character in my string and gives me a list with a matrix, where it tries to match every single character in the string, but since it only matches a few, I get a lot of empty strings. I would like only the match as a result, no list or matrix or other.

I have a bibliography and want to extract every reference to a volume, an issue or a number (including the following numbers, roman and arabic). So it mathes Volume, Issue and Number 1 as well as I or II. Sometimes thre are multiple of those in my string as well (Volume 3, Issue 2). Can anyone tell me why it checks every single character?

This is my code so far:

string <- 'ABC  (2013c), Something Something Text (Volume II): Some more blabla, the usual, end of string'

pattern <- "[V|v]ol(?:ume)?\\s*(\\d+|(V?I{0,3}X?L?C{0,3}D?M?))|(?:\\s+(Issue|No|Nr|nr|no|Number)\\s*(\\d+|V?I{0,3}X?L?C{0,3}D?M?))?"
  matches <- str_match_all(string , pattern)

Solution

  • The main issue is that your pattern part after | is wrapped with an optional non-capturing group and even if all other typos are fixed, that problem still needs to be resolved.

    The number matching part is the same on both ends of the OR operator, so you can merge both alternatives into one and simply use

    string <- 'ABC  (2013c), Something Something Text (Volume II): Some more blabla, the usual, end of string'
     
    rx <- paste0("\\b(?:[Vv]ol(?:ume)?|Issue|No|Nr|nr|no|Number)\\s*(?:\\d+|V?I{0,3}X?L?C{0,3}D?M?)")
    library(stringr)
    str_extract_all(string, rx)
    ## => [[1]]
    ##    [1] "Volume II"
    

    See the R demo online

    The pattern will look like

    \b(?:[Vv]ol(?:ume)?|Issue|No|Nr|nr|no|Number)\s*(?:\d+|V?I{0,3}X?L?C{0,3}D?M?)
    

    See the regex demo. Details:

    • \b - a word boundary
    • (?:[Vv]ol(?:ume)?|Issue|No|Nr|nr|no|Number) - vol, Vol, volume, Volume, Issue, No, Nr, nr, no or Number
    • \s* - zero or more whitespaces
    • (?:\d+|V?I{0,3}X?L?C{0,3}D?M?) - one or more digits or an optional V, then zero to three Is, then an optional X, an optional L, a C zero to three occurrences and then an optional D and an optional M.