Search code examples
regexvimregex-lookarounds

Why does vim consume the pattern after \ze in this case?


Fyi, this question originated from this sed answer.

Given a 5 columns CSV line with all 5 columns empty, i.e. a line which only contains ,,,,, I thought the following vim-ex command should insert hello in all 5 positions:

:s/\v(^|,)\ze(,|$)/\1hello/g

However it does not, as the output is

hello,,hello,hello,hello

The first hello is inserted because ^\ze, matches at the beginning of line. However it seems that this , is consumed by the command. Is this the case? If so, why?


Solution

  • I'm not sure of the answer, but I can share a hunch. I think this boils down to entirely zero-width match/replace patterns (e.g. /^\ze,) having to move some ethereal match index by one, even if it technically hasn't consumed anything. That way it can still go to some next match, or else it will just keep matching in the same position (if that makes sense).

    Your example seems to evidence of that. A more illustrative example would be the following (changing the input to better show what was matched).

    Given the following command:

    :s/\v(^|.)\ze(.|$)/<0\11\22>/g
    

    Running it against an input line of abcd will output:

    <01a2>a<0b1c2><0c1d2><0d12>
    

    Note how the a is both matched/replaced (in <01a2>), and is also unmatched as shown by the a in <01a2>a<0b1c2>. This prevents the ab pair from being matched/replaced.

    The only thing I can think of that would explain this is that idea of some match cursor or match index having to move past the first character of a after being matched by the first zero-width pattern of /^\ze.

    In other words:

    Input: abcd
    Command: s/\v(^|.)\ze(.|$)/<0\11\22>/g
    ======================================
    
    Match/Replace 1:
    abcd => <01a2>abcd
    ^              ^
    
    Matches /^ze.
    Will move cursor by 1 after the zero-width /^\ze. match (or else it would be stuck there)
    
    ----------------
    
    Match/Replace 2:
    <01a2>abcd  =>  <01a2>a<0b1c2>cd
           ^                      ^
    
    Matches /.\ze.
    Consumes the '.' (in this case 'b').  Not entirely zero-width.
    
    ... and so on ...