I'm trying to get all the headers of a bash CURL command using RegEx group capturing, but I'm having a problem where it's just grabbing all the headers in one group (and another group that I don't exactly know why it's happening).
The bash:
curl '<url>' -H 'origin: <url>' -H 'accept-endocing: <...>' -H 'accept-language: <...>' <continues with more headers> --data '<...>'
and it goes on with other headers.
The code:
var rawBash = RawBash.Text;
var headerPattern = @"\-H[\s][\']{1}(.+)[\']{1}";
var headers = Regex.Match(rawBash, headerPattern);
I've tested the pattern here and it says "11 Captures", and 'correctly' indicates the groups I want captured, but when I debug the code it indicates that 2 groups were captured:
What's happening? I'm guessing the Regex is taking the (.+)
and not terminating when it hits the [\']{1}
because '
matches (.+)
... but how do I make it capture each individual header in a group?
I've tried to read through a few C# RegEx tutorials/descriptions, but I haven't been able to find what I'm looking for (or describe what I'm looking for in the correct wording).
EDIT: Literally seconds after posting I had the idea to try this pattern:
var headerPattern = @"\-H[\s][\']{1}([^\']+)[\']{1}";
Notice the group is now ([^\']+)
instead of (.+)
. It is now working as I want it to.
Also, I'm using Regex.Match(...)
, and it should be Regex.Matches(...)
to get all the matches.
But I guess the question sort of still stands; how can someone terminate a group capture at a point? I'm recalling a friend using the term forward lookup
in what I believe was a similar situation, but I have no idea how to implement.
What you're seeing is the effects of greedy vs lazy (or non-greedy) matching.
Greedy matching will match as many characters as possible Lazy matching will only match as many characters as required.
In your original pattern (.+)
is a greedy matching of one or more or any character. So it will grab from your first -H '
to the last '
What you changed it to, ([^\']+)
, is also greedy, but its terminating early because its not matching any character, its only matching characters that are not a '
.
You can change a *
or +
to lazy by adding a ?
directly after.
My solution to your header matcher is (assuming your example string is fairly representative of a consistent format).
\-H\s+\'(.+?)\'
Your friend is referring to a positive lookahead. This looks forward in the string to get a successful match, however what it matches on is not part of the full match string. The syntax is (?=...)
. There is also a negative lookahead (?!...)
and positive and negative lookbehinds, (?<=...)
and (?<!...)
respectively. They should be used with caution as they can be really inefficient on longer strings.
For example take the following 2 strings:
regex isnt always the right answer|this will match
regex isnt always the right answer|this will not
if I used the following pattern:
regex (is.*) always (the right answer(?=.*this will match))
will result in this for the first string:
Full match 0-34 `regex isnt always the right answer`
Group 1. 6-10 `isnt`
Group 2. 18-34 `the right answer`
and will not match the second at all.