Search code examples
regexvimregex-lookaroundsneovimnegative-lookbehind

Searching unescaped ampersands in a pseudo XML file on Neovim with negative lookahead


I don't know how to transform / the equivalent of this negative lookahead search on Neovim.

&(?!(?:apos|quot|[gl]t|amp);|#)

When I try silver search, it is working. I want to search but only on the single file using /


Solution

  • Your question is interesting and I effectively have often troubles with the syntax of regular expressions in Vim or other tools that don't use the PCRE or common syntaxes!

    Googling a bit, I found this article about lookarounds in Vim.

    As you can see, it's just a matter of syntax, again!

    A negative lookahead such as (?!amp;) should be written \(amp;\)\@!.

    This leads to something like this:

    &\(\(\w\{2,6\}\|#\d\{1,6\}\|#[xX][0-9a-fA-F]\{1,6\}\);\)\@!
    

    I match &, ' with &\w{2,6}; in PCRE, which becomes &\w\{2,6\}; in Vim's syntax.

    Tested that on this XML:

    <note>
      <author>Jani</author>
      <heading>Reminder</heading>
      <summary>Glasses &amp; sunscreen</summary>
      <body>
        Don't forget to pack your glasses & sunscreen to go to
        the beach tomorrow!
        If you forget your glasses &#x1F60E; -&gt; you'll damage your retina.
        If you don't put some sunscreen on -&gt; you'll probably get sun burnt &#127774;.
        Now, the problem is to match the ampersand in "H&M" &#128556;!
        &lt; = &#60;
        &gt; = &#62;
        &amp; = &
        &nbsp; = non-breaking space
        &iexcl; = ¡
        &cent; = ¢
        &pound; = £
        &curren; = ¤
        &yen; = ¥
        &brvbar; = ¦
        &sect; = §
        &uml; = ¨
        &copy; = ©
        &ordf; = ª
        &laquo; = «
        &not; = ¬
        &shy; = ­
        &reg; = ®
        &macr; = ¯
        &deg; = °
        &plusmn; = ±
        &sup2; = ²
        &sup3; = ³
        &acute; = ´
        &micro; = µ
        &para; = ¶
        &cedil; = ¸
        &sup1; = ¹
        &ordm; = º
        &raquo; = »
        &frac14; = ¼
        &frac12; = ½
        &frac34; = ¾
        &iquest; = ¿
        &times; = ×
        &divide; = ÷
      </body>
    </note>
    

    In Vim, you have to escape parenthesis, braces and pipes but not square brackets! This is clearly not very readable. Perhaps there are some extensions to make it easier to use. Just Googled a bit and found Perl compatible regular expressions in Vim.

    I've started writing myself a note about the flavours of regular expression engines. It might be useful for others:

    PCRE sed and vim Description
    . . match any char
    * * 0 or n times
    + \+ 1 or more times
    ? \? 0 or 1 time
    ^ ^ begin of pattern
    $ $ end of pattern
    {3} \{3\} 3 times
    {3,} \{3,\} 3 or more times
    (regexp) \(regexp\) Group matching "regexp"
    [abc] [abc] "a", "b", or "c"
    [^abc] [^abc] any char not "a", "b" or "c"
    \2 \2 back reference of group n°2
    (?: ) non-capturing group
    (?=this-after) \(this-after\)\@= Vim✔️, sed Positive lookahead
    (?!not-this-after) \(not-this-after\)@! Vim✔️, sed Negative lookahead
    (?<=this-before) \(this-before\)@<= Vim✔️, sed Positive lookbehind
    (?<!not-this-before) \(not-this-before\)@<! Vim✔️, sed Negative lookbehind

    It seems that sed doesn't handle lookarounds, but the syntax is very similar to Vim for most of the other cases.

    Thanks to Friedrich's comment, Vim has helpful patterns to define the start and end of a match: \zs and \ze.

    You can place \zs anywhere in the search, Vim will only match after the start. You can use both to say "find this specific pattern and only replace a part of it". Example with this text:

    James Bond and James Cameron are well known, but not James Tartempion.
    

    If you only want to uppercase "James" followed by " Bond" or " Cameron":

    :s/James\ze \(Bond\|Cameron\)/JAMES/gi
    

    But if you need negative lookarounds, then it might be more complicated to write the pattern this way, as you'll probably have to use negative character classes. In this case, I would use the negative lookarounds to make the pattern more readable. Typically, to uppercase all "James" which aren't followed by " Tartempion":

    :s/James\( Tartempion\)\@!/JAMES/gi
    

    Using Perl inside Vim

    If Vim is installed with the Perl extension (my case out of the box in Cygwin and Ubuntu), then you can simply use PCRE regular expressions in Vim, typically for your problem of ampersands that need to be converted to HTML entities:

    :perldo s/&(?!(?:#\d{2,6}|#x[0-9a-fA-F]{2,6}|\w{2,6});)/&amp;/g
    

    And for the "James" not followed by " Tartempion" example:

    :perldo s/James(?! +Tartempion)/JAMES/gi