Search code examples
rubyregexstringstring-matching

regex matching in ruby confusion


Can anyone explain this to me?

str = "org-id:         N/A\n"

puts str[/org-id:\s+(.+)\n/]
=> "org-id:         N/A\n"
str =~ /org-id:\s+(.+)\n/
puts $1
=> "N/A"

All I need is

str =~ /org-id:\s+(.+)\n/
puts $1

in one line. But str[/org-id:\s+(.+)\n/] and str.slice(/org-id:\s+(.+)\n/) return "org-id: N/A\n" and str.scan(/org-id:\s+(.+)\n/).first returns ["N/A"] (and array). Why are all this matchings acting differently?


Solution

  • From the fine manual:

    str[regexp] → new_str or nil
    str[regexp, fixnum] → new_str or nil

    If a Regexp is supplied, the matching portion of str is returned. If a numeric or name parameter follows the regular expression, that component of the MatchData is returned instead.

    So, if you do str[/org-id:\s+(.+)\n/] then you get the entire matching portion (AKA $&); if you want the first capture group (AKA $1), then you could say:

    puts str[/org-id:\s+(.+)\n/, 1]
    # 'N/A'
    

    If you had a second capture group in your regex and you wanted what it captured, you could say str[regex, 2] and so on. You could also use a named capture group and a symbol thusly:

    puts str[/org-id:\s+(?<want>.+)\n/, :want]
    

    So with the right pattern and arguments, String#[] is convenient for pulling a single regex-based chunk out of a string.

    If you look at the manual you should notice that String#[] and String#splice are the same thing.


    If we look at String#=~, we see that:

    str =~ obj → fixnum or nil

    Match—If obj is a Regexp, use it as a pattern to match against str, and returns the position the match starts, or nil if there is no match.

    So when you say:

    str =~ /org-id:\s+(.+)\n/
    

    you get 'org-id: N/A' in $&, 'N/A' in $1, and the operator's return value is the number zero; if there was another capture group in your regex, you'd see that part in $2. The "nil or not nil" return value of =~ allows you to say things like:

    make_pancakes_for($1) if(str =~ /some pattern that makes (us) happy/)
    

    So =~ is convenient for combining parsing and boolean tests in one go.


    The String#scan method:

    scan(pattern) → array
    scan(pattern) {|match, ...| block } → str

    Both forms iterate through str, matching the pattern (which may be a Regexp or a String). For each match, a result is generated and either added to the result array or passed to the block. If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.

    So scan gives you a simple list of matches or an AoA of matches if capture groups are involved and scan is meant to pull apart a string into all its component pieces in one go (sort of like a more complicated version of String#split).

    If you wanted to grab all of the (.+) matches from your string you'd use scan and map:

    array_of_ids = str.scan(/org-id:\s+(.+)\n/).map(&:first)
    

    but you'd only bother with that if you knew there would be several org-ids in str. Scan will also leave $&, $1, ... set to the values for the last match in the scan; but if you're using scan you'll be looking for several matches at once so those globals won't be terribly useful.


    The three regex approaches ([], =~, and scan) offer similar functionality but they fill different niches. You could do it all with scan but that would be pointlessly cumbersome unless you were an orthogonality bigot and then you certainly wouldn't be working in Ruby except under extreme duress so it wouldn't matter.