Search code examples
perlregexp-grammars

Regexp::Grammars handling \n


I'm running the example from slide 15:

qr{
  <data>
  <rule: data>    <[text]>+
  <rule: text>    .+
}xm;

When running against a multi-line text:

line_1
line_2

I get:

'text' => [ 'line-1',
            '
            line-2' ]

and so far I've not succeeded getting rid of the '\n' in front of the second line captured.

Running Regexp::Grammers 1.048 on top of Strawberry perl 5.26.1.

update / clarification Having (pre-maturely - sorry!) raised a bug against the module, Damian clarified as follows (reply slightly adapted to match above example):

A rule with whitespace within it matches any whitespace (including newlines) in the input at that point. So a rule like:

<rule: text>    .+

is really equivalent to:

<rule: text><.ws>.+

meaning: match-but-don't-capture any leading whitespace, then match any-characters-except-newline.

If you want whitespace inside the rule to be ignored (as you seem to want here), then you need to declare the rule as a token instead. Tokens don't have the magical "whitespace-matches-whitespace" behaviour of rules. Hence you would write:

<token: line> .+

in which case you will also need to explicitly consume the newlines separating each line, with something like:

<rule: data> <[line]>+ % \n

Solution

  • This works:

    qr{
      <data>
      <rule: data>  <[text]>+ % [\r\n]+
      <rule: text>  .+
    }xm;
    

    The lines of data are meant to be separated by EOL character(s) which the

    [\r\n]+
    

    specifies. Note: some Windows files end each line with both a new line \n and a line feed \r character hence the [\r\n]+ pattern. You can read more about this by doing a perldoc Regexp::Grammars and searching for separator