I'm writing a Perl program that needs to parse a table written in a Wiki markup language. The table syntax uses the pipe character '|' to separate the columns.
| row 1 cell 1 |row 1 cell 2 | row 1 cell 3|
| row 2 cell 1 | row 2 cell 2 |row 2 cell 3|
A cell may contain zero or more hyperlinks, whose syntax is illustrated by:
[[wiki:path:to:page|Page Title]] or
[[wiki:path:to:page]]
Note that the hyperlink may contain the pipe character. Here, however, it is "quoted" by the [[..]] brackets.
The hyperlink syntax may not be nested.
In order to match and capture the first cell in each of these table rows,
| Potatoes [[path:to:potatoes]] | Daisies |
| Kiki fruit [[path:to:kiwi|Kiwi Fruit]] | Lemons|
I tried:
qr{\| # match literal pipe
(.*? # non-greedy zero or more chars
(?:\[\[.*?\]\]) # a hyperlink
.*?) # non-greedy zero or more chars
\|}x # match terminating pipe
It worked, and $1 contained the cell contents.
Then, to match
| Potatoes | Daisies |
I tried making the hyperlink optional:
qr{\| # match literal pipe
(.*? # non-greedy zero or more chars
(?:\[\[.*?\]\])? # <-- OPTIONAL hyperlink
.*?) # non-greedy zero or more chars
\|}x # match terminating pipe
This worked, but when parsing
| Kiki fruit [[path:to:kiwi|Kiwi Fruit]] | Lemons|
I only got
Kiki fruit [[path:to:kiwi
So evidently, given the option, it decided to disregard the hyperlink pattern and treat the embedded pipe as a column delimiter.
Here I'm stuck. And I still haven't dealt with either the possibility of the hyperlink occurring more than once in a cell, or with giving back the trailing pipe to be the leading pipe on the next iteration.
It isn't necessary that the regexp be used within Perl's split
function -- I can write the splitting loop myself if it's easier. I see many similar questions being asked, but none seem to deal closely enough with this problem.
$ perl -MRegexp::Common -E '$_=shift; while (
/\| # beginning pipe, and consume it
( # capture 1
(?: # inside the pipe we will do one of these:
$RE{balanced}{-begin=>"[["}{-end=>"]]"} # something with balanced [[..]]
|[^|] # or a character that is not a pipe
)* # as many of those as necessary
) # end capture one
(?=\|) # needs to go to the next pipe, but do not consume it so g works
/xg
) { say $1 }' '| Kiki fruit [[path:to:kiwi|Kiwi Fruit]] | Lemons|'
Kiki fruit [[path:to:kiwi|Kiwi Fruit]]
Lemons
This seems to extract the ones you're looking for. However, I suspect you're better off with a proper parser for this language. I'd be surprised if there wasn't something on cpan, but even if not, writing a parser for this may still be better especially as you start to get more weird things in your tables that you need to handle.