Search code examples
regexperlsplitdelimiterquotes

regexp that splits a string but ignores a quoted delimiter


I'm writing a Perl program that needs to parse a table written in a Wiki markup language. The table syntax uses the pipe character '|' to separate the columns.

| row 1 cell 1    |row 1 cell 2  | row 1 cell 3|
| row 2 cell 1    | row 2 cell 2 |row 2 cell 3|

A cell may contain zero or more hyperlinks, whose syntax is illustrated by:

[[wiki:path:to:page|Page Title]]   or
[[wiki:path:to:page]]

Note that the hyperlink may contain the pipe character. Here, however, it is "quoted" by the [[..]] brackets.

The hyperlink syntax may not be nested.

In order to match and capture the first cell in each of these table rows,

| Potatoes [[path:to:potatoes]]           | Daisies           |
| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|

I tried:

qr{\|                      # match literal pipe
    (.*?                   # non-greedy zero or more chars
        (?:\[\[.*?\]\])    # a hyperlink 
     .*?)                  # non-greedy zero or more chars
   \|}x                    # match terminating pipe

It worked, and $1 contained the cell contents.

Then, to match

| Potatoes            | Daisies           |

I tried making the hyperlink optional:

qr{\|                      # match literal pipe
    (.*?                   # non-greedy zero or more chars
        (?:\[\[.*?\]\])?   # <-- OPTIONAL hyperlink 
     .*?)                  # non-greedy zero or more chars
   \|}x                    # match terminating pipe

This worked, but when parsing

| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|

I only got

 Kiki fruit [[path:to:kiwi

So evidently, given the option, it decided to disregard the hyperlink pattern and treat the embedded pipe as a column delimiter.

Here I'm stuck. And I still haven't dealt with either the possibility of the hyperlink occurring more than once in a cell, or with giving back the trailing pipe to be the leading pipe on the next iteration.

It isn't necessary that the regexp be used within Perl's split function -- I can write the splitting loop myself if it's easier. I see many similar questions being asked, but none seem to deal closely enough with this problem.


Solution

  • $ perl -MRegexp::Common -E '$_=shift; while (
      /\| # beginning pipe, and consume it
      (   # capture 1
        (?:  # inside the pipe we will do one of these:
          $RE{balanced}{-begin=>"[["}{-end=>"]]"} # something with balanced [[..]]
          |[^|] # or a character that is not a pipe
        )* # as many of those as necessary
      ) # end capture one
      (?=\|) # needs to go to the next pipe, but do not consume it so g works
      /xg
    ) { say $1 }' '| Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  |             Lemons|'
     Kiki fruit [[path:to:kiwi|Kiwi Fruit]]  
                 Lemons
    

    This seems to extract the ones you're looking for. However, I suspect you're better off with a proper parser for this language. I'd be surprised if there wasn't something on cpan, but even if not, writing a parser for this may still be better especially as you start to get more weird things in your tables that you need to handle.