Search code examples
regexvimmatchtext-extraction

How to extract all regex matches in a file using Vim?


Consider the following example:

case Foo:
    ...
    break;
case Bar:
    ...
    break;
case More: case Complex:
    ...
    break:
...

Say, we would like to retrieve all matches of the regex case \([^:]*\): (the whole matching text or, even better, the part between \( and \)), which should give us (preferably in a new buffer) something like this:

Foo
Bar
More
Complex
...

Another example of a use case would be extraction of some fragments of an HTML file, for instance, image URLs.

Is there a simple way to collect all regex matches and take them out to a separate buffer in Vim?

Note: It’s similar to the question “How to extract text matching a regex using Vim?”. However, unlike the setting in that question, I’m also interested in removing the lines that don’t match, preferably without a hugely complicated regex.


Solution

  • There is a general way of collecting pattern matches throughout a piece of text. The technique takes advantage of the substitute with an expression feature of the :substitute command (see :help sub-replace-\=). The key idea is to use a substitution enumerating all of the pattern matches to evaluate an expression storing them without replacement.

    First, let us consider saving the matches. In order to keep a sequence of matching text fragments, it is convenient to use a list (see :help List). However, it is not possible to modify a list straightforwardly, using the :let command, since there is no way to run Ex commands in expressions (including \= substitute expressions). Yet, we can call one of the functions that modify a list in place, for example, the add() function that appends a given item to a list (see :help add()).

    Another problem is how to avoid text modifications while running a substitution. One approach is to make the pattern always have a zero-width match by prepending \ze or by appending \zs atoms to it (see :help /\zs, :help /\ze). The pattern modified in this way captures an empty string preceding or succeeding an occurrence of the original pattern in text (such matches are called zero-width matches in Vim; see :help /zero-width). Then, if the replacement text is also empty, substitution effectively changes nothing: it just replaces a zero-width match with an empty string.

    Since the add() function, like most of the list modifying functions, returns the reference to the changed list, for our technique to work we need to somehow get an empty string from it. The simplest way is to extract a sublist of zero length from it by specifying a range of indices such that a starting index is greater than an ending one.

    Combining the aforementioned ideas, we obtain the following Ex command:

    :let m=[] | %s/\<case\s\+\(\w\+\):\zs/\=add(m,submatch(1))[1:0]/g
    

    After its execution, all matches of the first subgroup are accumulated in the list referenced by the variable m, and can be used as is or processed in some way. For instance, to paste the contents of the list one by one on separate lines in Insert mode, type

    Ctrl+R=mEnter

    To do the same in Normal mode, simply use the :put command:

    :put=m
    

    Starting with version 7.4 (see :helpg Patch 7.3.627), Vim evaluates a \= expression in the replacement string of a substitution command for every match of the pattern, even when the n flag is given (which instructs it to simply count the number of matches without substituting—see :help :s_n). What the expression evaluates to does not matter in that case, because the resulting value is being discarded anyway, as no substitution takes place during counting.

    This allows us to take advantage of the side effects of an expression without worrying about leaving the contents of the buffer in tact in the process, so all the trickery with zero-width matching and empty-sublist indexing can be elided:

    :let m=[] | %s/\<case\s\+\(\w\+\):/\=add(m,submatch(1))/gn
    

    Conveniently, the buffer does not even get marked as modified after running this command.