Search code examples
rregexregex-lookaroundsls

Using negative lookahead in ls(pattern = "") in R


Suppose I have the following objects in the memory:

ab
ab_b
ab_pm
ab_pn
c1_ab_b

and I only want to keep ab_pm and ab_pn.

I tried to use negative lookahead in ls() to list ab, ab_b and c1_ab_b for removal:

rm(list = ls(pattern = "ab_?(?!p)")

However, I got the error:

Error in grep(pattern, all.names, value = TRUE) :
  invalid regular expression 'ab_?(?!p)', reason 'Invalid regexp'

I tried my regex at regex101.com, and found it matched all five object names, which suggested my regex was not "invalid", although it did not do what I wanted. My questions are:

  1. Does ls() in R support negative lookahead? I know grep() needs perl = TRUE to support it, but do not see a similar argument in the ls() help documentation.
  2. How to correctly select the three objects I wanted to remove?

Solution

  • Your ab_?(?!p) PCRE regex does not match as expected because of backtracking. It matches ab, then it matches an optional _ and then tries the negative lookaround. When the lookaround finds p backtracking occurrs, and the lookahead is triggered again right before _. Since _ is not p, a match is returned.

    The correct PCRE regex would be ab(?!_?p), see the regex demo. After matching b, the regex engine tries the lookahead pattern only once, and if it fails to match an optional _ followed with a p, the whole match will fail.

    ls does not support perl=TRUE, so it only supports the default TRE regex library that does not support lookarounds.

    You may use

    ab([^_]p|_[^p]|.?$)
    

    See the regex demo. Details:

    • ab - ab text
    • ([^_]p|_[^p]|.?$) - either of the three alternatives:
      • [^_]p - any char but _ and then p
      • | - or
      • _[^p] - a _ and then any char but p
      • | - or
      • .?$ - any one optional char and then end of string.