Search code examples
c#regexregex-group

Get individual headers from CURL with Regex Group


I'm trying to get all the headers of a bash CURL command using RegEx group capturing, but I'm having a problem where it's just grabbing all the headers in one group (and another group that I don't exactly know why it's happening).

The bash:

curl '<url>' -H 'origin: <url>' -H 'accept-endocing: <...>' -H 'accept-language: <...>' <continues with more headers> --data '<...>'

and it goes on with other headers.

The code:

var rawBash = RawBash.Text;
var headerPattern = @"\-H[\s][\']{1}(.+)[\']{1}";
var headers = Regex.Match(rawBash, headerPattern);

I've tested the pattern here and it says "11 Captures", and 'correctly' indicates the groups I want captured, but when I debug the code it indicates that 2 groups were captured:

  1. The entire CURL starting with the first "-H"
  2. The entire CURL starting with "origin:"

What's happening? I'm guessing the Regex is taking the (.+) and not terminating when it hits the [\']{1} because ' matches (.+)... but how do I make it capture each individual header in a group?

I've tried to read through a few C# RegEx tutorials/descriptions, but I haven't been able to find what I'm looking for (or describe what I'm looking for in the correct wording).

EDIT: Literally seconds after posting I had the idea to try this pattern:

var headerPattern = @"\-H[\s][\']{1}([^\']+)[\']{1}";

Notice the group is now ([^\']+) instead of (.+). It is now working as I want it to.

Also, I'm using Regex.Match(...), and it should be Regex.Matches(...) to get all the matches.

But I guess the question sort of still stands; how can someone terminate a group capture at a point? I'm recalling a friend using the term forward lookup in what I believe was a similar situation, but I have no idea how to implement.


Solution

  • What you're seeing is the effects of greedy vs lazy (or non-greedy) matching.

    Greedy matching will match as many characters as possible Lazy matching will only match as many characters as required.

    In your original pattern (.+) is a greedy matching of one or more or any character. So it will grab from your first -H ' to the last '

    What you changed it to, ([^\']+), is also greedy, but its terminating early because its not matching any character, its only matching characters that are not a '.

    You can change a * or + to lazy by adding a ? directly after.

    My solution to your header matcher is (assuming your example string is fairly representative of a consistent format).

    \-H\s+\'(.+?)\'
    

    Your friend is referring to a positive lookahead. This looks forward in the string to get a successful match, however what it matches on is not part of the full match string. The syntax is (?=...). There is also a negative lookahead (?!...) and positive and negative lookbehinds, (?<=...) and (?<!...) respectively. They should be used with caution as they can be really inefficient on longer strings.

    For example take the following 2 strings:

    regex isnt always the right answer|this will match
    
    regex isnt always the right answer|this will not
    

    if I used the following pattern:

    regex (is.*) always (the right answer(?=.*this will match))
    

    will result in this for the first string:

    Full match  0-34    `regex isnt always the right answer`
    Group 1.    6-10    `isnt`
    Group 2.    18-34   `the right answer`
    

    and will not match the second at all.