Search code examples
regexpcreregex-group

How to use capture groups with the `\K` reset match?


I found this question about using capture groups with the \K reset match (i.e., not sure if that's the correct name), but it does not answer my query.

Suppose I have the following string:

ab

With the following regex a\Kb the output is, as expected, b:

enter image description here

However, when adding a capture group (i.e., $1) using the regex (a\Kb), group $1 returns ab and not a:

enter image description here

Given the following string:

ab
cd

Using the regex (a\Kb)|(c\Kd) I would hope group $1 to contain b and group $2 to contain d, but that is not the case as it can be seen below:

enter image description here

I tried Wiktor Stribiżew's answer that points to using a branch reset group:

(?|a\Kb)|(?|c\Kd)

Which produces:

enter image description here

However, now the matches are both part of group $0, whereas I require them to be part of group $1 and $2, respectively. Do you have any ideas on how this can be achieved? I am using Oniguruma regular expressions and the PCRE flavor.


Update based on the comments below.

The example above was meant to be easy to understand and reproduce. @Booboo pointed out that a non-capturing group does the trick, i.e.,:

(?:a\K(b))|(?:c\K(d))

Produce the output:

enter image description here

However, when applied to another example it fails. Therefore, for clarity, I am extending this question to cover the more complicated scenario discussed in the comments.

Suppose I have the following text in a markdown file:

- [x] Example task. | Task ends. [x] Another task.
- [x] ! Example task. | This ends. [x] ! Another task.

This is a sentence. [x] Task is here.
Other text. Another [x] ! Task is here.

|       | Task name     |    Plan     |   Actual    |      File      |
| :---- | :-------------| :---------: | :---------: | :------------: |
| [x]   | Task example. | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |
| [x] ! | Task example. | 08:00-08:45 | 08:00-09:00 |  [[task-one]]  |

I am interested in a single regex expression with two capture groups as follows:

  • group $1 (i.e., see selection below):

    • outside the table: capture everything after [x] (i.e., not followed by !) until a |

    • inside the table: capture everything after [x] (i.e., not followed by !) excluding the | symbols

      Matches for first capture group

  • group $2 (i.e., see selection below):

    • outside the table: capture everything after [x] ! until a |

    • inside the table: capture everything after [x] ! excluding the | symbols

      Mataches for the second capture group

I have the following regex (i.e., see demo here) that works when evaluated individually, but not when used inside a capture group:

  • group $1:
    • outside the table: [^\|\s]\s*\[x\]\s*\K[^!|\n]*
    • inside the table: (?:\G(?!\A)\||(?<=\[x]\s)\s*\|)\K[^|\n]*(?=\|)
  • group $2:
    • outside the table: [^\|\s]\s*\[x\]\s*\!\s*\K[^|\n]*
    • inside the table: (?:\G(?!\A)\||(?<=\[x]\s)\s*\!\s*\|)\K[^|\n]*(?=\|)

The problem I am experiencing is when combining the expressions above.

Pseudo regex:

([x] outside|[x] inside)|([x] ! outside|[x] ! inside)

Actual regex:

([^\|\s]\s*\[x\]\s*\K[^!|\n]*|(?:\G(?!\A)\||(?<=\[x]\s)\s*\|)\K[^|\n]*(?=\|))|([^\|\s]\s*\[x\]\s*\!\s*\K[^|\n]*|(?:\G(?!\A)\||(?<=\[x]\s)\s*\!\s*\|)\K[^|\n]*(?=\|))

Which produces (i.e., as in the demo linked above):

enter image description here

The regex for the matches inside the table is based on Wiktor Stribiżew's answer and explained here.


Solution

  • You can use

    (?|(?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|)\h*\K([^|\n]+)(?<=\S)\h*\||\[x]\h*\K([^|\s!]+(?:\h*[^|\s]+)*))|(?|(?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|)\h*\K([^|\n]+)(?<=\S)\h*|\[x]\h*!\h*\K([^|\s]+(?:\h*[^|\s]+)*))
    

    See the regex demo. Details:

    • (?|(?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|)\h*\K([^|\n]+)(?<=\S)\h*\||\[x]\h*\K([^|\s!]+(?:\h*[^|\s]+)*)) - a branch reset group matching:

      • (?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|) - a non-capturing group matching either
        • \G(?!\A)(?<=\|) - the end of the previous successful match that is immediately preceded with a | char
      • | - or
        • ^\|\h*\[x\]\h*\| - start of a line/string, |, zero or more horizontal whitespaces, [x], zero or more horizontal whitespaces, |
      • \h*\K - zero or more horizontal whitespaces that are immediately discarded from the match value after matching
      • ([^|\n]+)(?<=\S) - Group 1: one or more chars other than a LF and |, as many as possible, but the chunk should match with a non-whitespace char
      • \h*\| - zero or more horizontal whitespaces and a | char
    • | - or

      • \[x]\h*\K - [x], zero or more horizontal whitespaces, and this text is discarded from the match value
      • ([^|\s!]+(?:\h*[^|\s]+)*) - Group 1 (mind it is a branch reset group): one or more chars other than !, | and whitespace, and then zero or more occurrences of zero or more horizontal whitespaces and then one or more chars other than | and whitespace
    • | - or

    • (?|(?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|)\h*\K([^|\n]+)(?<=\S)\h*|\[x]\h*!\h*\K([^|\s]+(?:\h*[^|\s]+)*)) - a branch reset group:

      • (?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|) - end of the previous successful match and a | char after, or start of string, |, zero or more horizontal whitespaces, [x], ! enclosed with zero or more horizontal whitespaces, a | char
      • \h*\K - zero or more horizontal whitespaces and the whole text matched so far is discarded from the match value
      • ([^|\n]+)(?<=\S) - Group 2: any one or more chars other than LF and | chars that end with a non-whitespace char
      • \h* - zero or more horizontal whitespaces
    • | - or

      • \[x] - a [x] string
      • \h*!\h*\K - ! enclosed with zero or more horizontal whitespaces and the whole text matched so far is discarded from the match value
      • ([^|\s]+(?:\h*[^|\s]+)*) - Group 2 (mind it is a branch reset group): one or more chars other than | and whitespace, and then zero or more occurrences of zero or more horizontal whitespaces and then one or more chars other than | and whitespace.