Search code examples
rregexpcrebackreference

R: regex to capture all instances after a given character


Given string ab cd ; ef gh ij, how do I remove all spaces after the first space after ;, i.e. ab cd ; efghij? I tried using \K but can't get it to work completely.

test = 'ab cd  ; ef  gh ij'
gsub('(?<=; )[^ ]+\\K +','',test,perl=T)
# "ab cd  ; efgh ij"

Solution

  • 1) gsubfn Using gsubfn in the gsubfn package, here is a one-liner that only uses simple regular expressions. It inputs the capture group into the indicated function (expressed in formula notation) and replaces the match with the output of the function.

    library(gsubfn)
    
    gsubfn("; (.*)", ~ paste(";", gsub(" ", "", x)), test)
    ## [1] "ab cd  ; efghij"
    

    2) gsub This uses a pattern consisting of space not immediately preceeded by a semicolon and not followed anywhere in the remainder of the string by a semicolon.

    gsub("(?<!;) (?!.*; )", "", test, perl = TRUE)
    ## [1] "ab cd  ; efghij"
    

    3) regexpr/substring This finds he position of the semicolon and then uses substring to break it into two and replace the spaces with gsub finally pasting it back together.

    ix <- regexpr(";", test)
    paste(substring(test, 1, ix), gsub(" ", "", substring(test, ix + 2)))
    ## [1] "ab cd  ; efghij"
    

    4) read.table This is similar to (3) but uses read.table to break the input into two fields.

    with(read.table(text = test, sep = ";", as.is = TRUE), paste0(V1, "; ", gsub(" ", "", V2)))
    ## [1] "ab cd  ; efghij"