Search code examples
rregexnegative-lookbehind

Variable-length negative look-behind for replacing list delimiters and the string end


I wish to insert an index after values that miss an index within a string of delimited values, including the last one.

s <- "Dee, DP(Dee, D. P.)[1];Uppala, SM(Uppala, S. M.);Simmons, AJ(Simmons, A. J.);Kobayashi, S(Kobayashi, S.)[2];Andrae, U(Andrae, U.)"
gsub("(?<!\\[\\d\\])(;|$)", "\\[0\\]\\2", s, perl=TRUE)

This code gives me missing delimiters:

"Dee, DP(Dee, D. P.)[1];Uppala, SM(Uppala, S. M.)[0]Simmons, AJ(Simmons, A. J.)[0]Kobayashi, S(Kobayashi, S.)[2];Andrae, U(Andrae, U.)[0]"

The code similarly handles the case where the last value already has an index:

s <- "Dee, DP(Dee, D. P.)[1];Uppala, SM(Uppala, S. M.);Simmons, AJ(Simmons, A. J.);Kobayashi, S(Kobayashi, S.)[2];Andrae, U(Andrae, U.)[3]"

Giving, this time (delimiters still missing):

"Dee, DP(Dee, D. P.)[1];Uppala, SM(Uppala, S. M.)[0]Simmons, AJ(Simmons, A. J.)[0]Kobayashi, S(Kobayashi, S.)[2];Andrae, U(Andrae, U.)[3]"

I need the missing delimiters back, e.g. in the first case:

"Dee, DP(Dee, D. P.)[1];Uppala, SM(Uppala, S. M.)[0];Simmons, AJ(Simmons, A. J.)[0];Kobayashi, S(Kobayashi, S.)[2];Andrae, U(Andrae, U.)[0]"

Additionally, I would like the code also to handle indices of more than one digit, i.e. 10, 100, etc. (variable length), e.g.:

s <- "Dee, DP(Dee, D. P.)[10];Uppala, SM(Uppala, S. M.);Simmons, AJ(Simmons, A. J.);Kobayashi, S(Kobayashi, S.)[2];Andrae, U(Andrae, U.)"

or

s <- "Dee, DP(Dee, D. P.)[10];Uppala, SM(Uppala, S. M.);Simmons, AJ(Simmons, A. J.);Kobayashi, S(Kobayashi, S.)[2];Andrae, U(Andrae, U.)[3]"

Solution

  • In your replacement pattern, \2 refers to a non-existing Group #2. (?<!...) is a negative lookbehind and unlike capturing groups does not force a regex engine to allocate any special memory buffer for its matched value.

    You must use \1, not \2.

    To solve the variable width pattern in the lookbehind, you may use a 2-step approach using a PCRE pattern with SKIP-FAIL verbs:

    s <- gsub("\\[\\d+];(*SKIP)(*F)|(;)", "[0]\\1", s, perl=TRUE)
    sub("(?s)^(?!.*\\[\\d+]$)(.*)", "\\1[0]", s, perl=TRUE)
    

    See the regex demo #1 / regex demo #2 and the R demo online

    Pattern #1 details

    • \[\d+];(*SKIP)(*F) - [, 1+ digits, ] and then ;; (*SKIP)(*F) discards this matched text from the overall match memory buffer and goes on to search for the regex pattern from the location where it failed
    • | - or
    • (;) - Group 1 (\1 in the replacement pattern): a ; char.

    Pattern #2

    • (?s) - turns the dotall mode so that . could match any char as in TRE regex (used by sub without perl=TRUE)
    • ^ - start of string
    • (?!.*\[\d+]$) - a negative lookahead that make sure there is no [, 1+ digits and ] at the end of string
    • (.*) - all the string is captured in Group 1.

    Or, if you can use stringr and you know the number of digits is fewer than some value, say, 100, you may leverage this constrained-width lookbehind ICU library feature:

    stringr::str_replace_all(s, "(?<!\\[\\d{1,100}])(;|$)", "[0]\\1")
    

    Here, (?<!\[\d{1,100}]) matches a location in string that is not immediately preceded with [, 1 to 100 digits, and then ].