Search code examples
rubyregexmediawikiwiki-markup

Regex to match pipes not within brackets or braces with nested blocks


I am trying to parse some wiki markup. For example, the following:

{{Some infobox royalty|testing
| name = Louis
| title = Prince Napoléon 
| elevation_imperial_note= <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>
| a = [[AA|aa]] | b =  {{cite
|title=TITLE
|author=AUTHOR}}
}}

can be the text to start with. I first remove the starting {{ and ending }}, so I can assume those are gone.

I want to do .split(<regex>) on the string to split the string by all | characters that are not within braces or brackets. The regex needs to ignore the | characters in [[AA|aa]], <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>, and {{cite|title=TITLE|author=AUTHOR}}. The expected result is:

[
 'testing'
 'name = Louis', 
 'title = Prince Napoléon', 
 'elevation_imperial_note= <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>',
 'a = [[AA|aa]]',
 'b =  {{cite\n|title=TITLE\n|author=AUTHOR}}'
]

There can be line breaks at any point, so I can't just look for \n|. If there is extra white space in it, that is fine. I can easily strip out extra \s* or \n*.

https://regex101.com/r/dEDcAS/2


Solution

  • The following is a pure-Ruby solution. I assume the braces and brackets in the string are balanced.

    str =<<BITTER_END
    Some infobox royalty|testing
    | name = Louis
    | title = Prince Napoléon 
    | elevation_imperial_note= <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>
    | a = [[AA|aa]] | b =  {{cite
    |title=TITLE
    |author=AUTHOR}}
    BITTER_END
    

    stack = []
    last = 0
    str.each_char.with_index.with_object([]) do |(c,i),locs|
      puts "c=#{c}, i=#{i}, locs=#{locs}, stack=#{stack}" 
      case c
      when ']', '}'
        puts "  pop #{c} from stack"
        stack.pop
      when '[', '{'
        puts "  push #{c} onto stack"
        stack << c
      when '|'
        puts stack.empty? ? "  record location of #{c}" : "  skip | as stack is non-empty" 
        locs << i if stack.empty?
      end
        puts "  after: locs=#{locs}, stack=#{stack}" 
    end.map do |i|
      old_last = last
      last = i+1
      str[old_last..i-1].strip if i > 0
    end.tap { |a| a << str[last..-1].strip if last < str.size }
      #=> ["Some infobox royalty",
      #    "testing",
      #    "name = Louis", 
      #    "title = Prince Napoléon",
      #    "elevation_imperial_note= <ref name=\"usgs\">
      #      {{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>",
      #    "a = [[AA|aa]]",
      #    "b =  {{cite\n|title=TITLE\n|author=AUTHOR}}"]
    

    Note that, to improve readability, I've broken the string that is the antepenultimate element of the returned array1.

    Explanation

    For an explanation of how the locations of the pipe symbols on which to split are determined, run the Heredoc above to determine str (the Heredoc needs to be un-indented first), then run the following code. All will be revealed. (The output is long, so focus on changes to the arrays locs and stack.)

    stack = []
    str.each_char.with_index.with_object([]) do |(c,i),locs|
      puts "c=#{c}, i=#{i}, locs=#{locs}, stack=#{stack}" 
      case c
      when ']', '}'
        puts "  pop #{c} from stack"
        stack.pop
      when '[', '{'
        puts "  push #{c} onto stack"
        stack << c
      when '|'
        puts stack.empty? ? "  record location of #{c}" : "  skip | as stack is non-empty" 
        locs << i if stack.empty?
      end
        puts "  after: locs=#{locs}, stack=#{stack}" 
    end
      #=> [20, 29, 44, 71, 167, 183]
    

    If desired, one can confirm the braces and brackets are balanced as follows.

    def balanced?(str)
      h = { '}'=>'{', ']'=>'[' }
      stack = []
      str.each_char do |c|
        case c
        when '[', '{'
          stack << c
        when ']', '}'
          stack.last == h[c] ? (stack.pop) : (return false)
        end
      end   
      stack.empty?
    end
    
    balanced?(str)
      #=> true
    
    balanced?("[[{]}]")
      #=> false
    

    1 ...and, in the interest of transparency, to have the opportunity to use a certain word.