I found this question about using capture groups with the \K
reset match (i.e., not sure if that's the correct name), but it does not answer my query.
Suppose I have the following string:
ab
With the following regex
a\Kb
the output is, as expected, b
:
However, when adding a capture group (i.e., $1
) using the regex
(a\Kb)
, group $1
returns ab
and not a
:
Given the following string:
ab
cd
Using the regex
(a\Kb)|(c\Kd)
I would hope group $1
to contain b
and group $2
to contain d
, but that is not the case as it can be seen below:
I tried Wiktor Stribiżew's answer that points to using a branch reset group:
(?|a\Kb)|(?|c\Kd)
Which produces:
However, now the matches are both part of group $0
, whereas I require them to be part of group $1
and $2
, respectively. Do you have any ideas on how this can be achieved? I am using Oniguruma regular expressions and the PCRE
flavor.
Update based on the comments below.
The example above was meant to be easy to understand and reproduce. @Booboo pointed out that a non-capturing group does the trick, i.e.,:
(?:a\K(b))|(?:c\K(d))
Produce the output:
However, when applied to another example it fails. Therefore, for clarity, I am extending this question to cover the more complicated scenario discussed in the comments.
Suppose I have the following text in a markdown
file:
- [x] Example task. | Task ends. [x] Another task.
- [x] ! Example task. | This ends. [x] ! Another task.
This is a sentence. [x] Task is here.
Other text. Another [x] ! Task is here.
| | Task name | Plan | Actual | File |
| :---- | :-------------| :---------: | :---------: | :------------: |
| [x] | Task example. | 08:00-08:45 | 08:00-09:00 | [[task-one]] |
| [x] ! | Task example. | 08:00-08:45 | 08:00-09:00 | [[task-one]] |
I am interested in a single regex
expression with two capture groups as follows:
group $1
(i.e., see selection below):
group $2
(i.e., see selection below):
I have the following regex
(i.e., see demo here) that works when evaluated individually, but not when used inside a capture group:
$1
:
[^\|\s]\s*\[x\]\s*\K[^!|\n]*
(?:\G(?!\A)\||(?<=\[x]\s)\s*\|)\K[^|\n]*(?=\|)
$2
:
[^\|\s]\s*\[x\]\s*\!\s*\K[^|\n]*
(?:\G(?!\A)\||(?<=\[x]\s)\s*\!\s*\|)\K[^|\n]*(?=\|)
The problem I am experiencing is when combining the expressions above.
Pseudo regex
:
([x] outside|[x] inside)|([x] ! outside|[x] ! inside)
Actual regex
:
([^\|\s]\s*\[x\]\s*\K[^!|\n]*|(?:\G(?!\A)\||(?<=\[x]\s)\s*\|)\K[^|\n]*(?=\|))|([^\|\s]\s*\[x\]\s*\!\s*\K[^|\n]*|(?:\G(?!\A)\||(?<=\[x]\s)\s*\!\s*\|)\K[^|\n]*(?=\|))
Which produces (i.e., as in the demo linked above):
The regex
for the matches inside the table is based on Wiktor Stribiżew's answer and explained here.
You can use
(?|(?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|)\h*\K([^|\n]+)(?<=\S)\h*\||\[x]\h*\K([^|\s!]+(?:\h*[^|\s]+)*))|(?|(?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|)\h*\K([^|\n]+)(?<=\S)\h*|\[x]\h*!\h*\K([^|\s]+(?:\h*[^|\s]+)*))
See the regex demo. Details:
(?|(?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|)\h*\K([^|\n]+)(?<=\S)\h*\||\[x]\h*\K([^|\s!]+(?:\h*[^|\s]+)*))
- a branch reset group matching:
(?:\G(?!\A)(?<=\|)|^\|\h*\[x\]\h*\|)
- a non-capturing group matching either
\G(?!\A)(?<=\|)
- the end of the previous successful match that is immediately preceded with a |
char|
- or
^\|\h*\[x\]\h*\|
- start of a line/string, |
, zero or more horizontal whitespaces, [x]
, zero or more horizontal whitespaces, |
\h*\K
- zero or more horizontal whitespaces that are immediately discarded from the match value after matching([^|\n]+)(?<=\S)
- Group 1: one or more chars other than a LF and |
, as many as possible, but the chunk should match with a non-whitespace char\h*\|
- zero or more horizontal whitespaces and a |
char|
- or
\[x]\h*\K
- [x]
, zero or more horizontal whitespaces, and this text is discarded from the match value([^|\s!]+(?:\h*[^|\s]+)*)
- Group 1 (mind it is a branch reset group): one or more chars other than !
, |
and whitespace, and then zero or more occurrences of zero or more horizontal whitespaces and then one or more chars other than |
and whitespace|
- or
(?|(?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|)\h*\K([^|\n]+)(?<=\S)\h*|\[x]\h*!\h*\K([^|\s]+(?:\h*[^|\s]+)*))
- a branch reset group:
(?:\G(?!\A)\||^\|\h*\[x]\h*!\h*\|)
- end of the previous successful match and a |
char after, or start of string, |
, zero or more horizontal whitespaces, [x]
, !
enclosed with zero or more horizontal whitespaces, a |
char\h*\K
- zero or more horizontal whitespaces and the whole text matched so far is discarded from the match value([^|\n]+)(?<=\S)
- Group 2: any one or more chars other than LF and |
chars that end with a non-whitespace char\h*
- zero or more horizontal whitespaces|
- or
\[x]
- a [x]
string\h*!\h*\K
- !
enclosed with zero or more horizontal whitespaces and the whole text matched so far is discarded from the match value([^|\s]+(?:\h*[^|\s]+)*)
- Group 2 (mind it is a branch reset group): one or more chars other than |
and whitespace, and then zero or more occurrences of zero or more horizontal whitespaces and then one or more chars other than |
and whitespace.