Search code examples
perlgrammarmarparegexp-grammars

Parse single quoted string using Marpa:r2 perl


How to parse single quoted string using Marpa:r2? In my below code, the single quoted strings appends '\' on parsing.

Code:

use strict;
use Marpa::R2;
use Data::Dumper;


my $grammar = Marpa::R2::Scanless::G->new(
   {  default_action => '[values]',
      source         => \(<<'END_OF_SOURCE'),
  lexeme default = latm => 1

:start ::= Expression

# include begin

Expression ::= Param
Param ::= Unquoted                                         
        | ('"') Quoted ('"') 
        | (') Quoted (')

:discard      ~ whitespace 
whitespace    ~ [\s]+

Unquoted      ~ [^\s\/\(\),&:\"~]+
Quoted        ~ [^\s&:\"~]+

END_OF_SOURCE
   });

my $input1 = 'foo';
#my $input2 = '"foo"';
#my $input3 = '\'foo\'';

my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar });

print "Trying to parse:\n$input1\n\n";
$recce->read(\$input1);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);

Output's:

Trying to parse:
foo

Output:
$VAR1 = [
          [
            'foo'
          ]
        ];

Trying to parse:
"foo"

Output:
$VAR1 = [
          [
            'foo'
          ]
        ];

Trying to parse:
'foo'

Output:
$VAR1 = [
          [
            '\'foo\''
          ]
        ]; (don't want it to be parsed like this)

Above are the outputs of all the inputs, i don't want 3rd one to get appended with the '\' and single quotes.. I want it to be parsed like OUTPUT2. Please advise.

Ideally, it should just pick the content between single quotes according to Param ::= (') Quoted (')


Solution

  • The other answer regarding Data::Dumper output is correct. However, your grammar does not work the way you expect it to.

    When you parse the input 'foo', Marpa will consider the three Param alternatives. The predicted lexemes at that position are:

    • Unquoted ~ [^\s\/\(\),&:\"~]+
    • '"'
    • ') Quoted ('

    Yes, the last is literally ) Quoted (, not anything containing a single quote.

    Even if it were ([']) Quoted ([']): Due to longest token matching, the Unquoted lexeme will match the entire input, including the single quote.

    What would happen for an input like " foo " (with double quotes)? Now, only the '"' lexeme would match, then any whitespace would be discarded, then the Quoted lexeme matches, then any whitespace is discarded, then closing " is matched.

    To prevent this whitespace-skipping behaviour and to prevent the Unquoted rule from being preferred due to LATM, it makes sense to describe quoted strings as lexemes. For example:

    Param ::= Unquoted | Quoted
    Unquoted ~ [^'"]+
    Quoted ~ DQ | SQ
    DQ ~ '"' DQ_Body '"'  DQ_Body ~ [^"]*
    SQ ~ ['] SQ_Body [']  SQ_Body ~ [^']*
    

    These lexemes will then include any quotes and escapes, so you need to post-process the lexeme contents. You can either do this using the event system (which is conceptually clean, but a bit cumbersome to implement), or adding an action that performs this processing during parse evaluation.

    Since lexemes cannot have actions, it is usually best to add a proxy production:

    Param ::= Unquoted | Quoted
    Unquoted ~ [^'"]+
    Quoted ::= Quoted_Lexeme action => process_quoted
    Quoted_Lexeme ~ DQ | SQ
    DQ ~ '"' DQ_Body '"'  DQ_Body ~ [^"]*
    SQ ~ ['] SQ_Body [']  SQ_Body ~ [^']*
    

    The action could then do something like:

    sub process_quoted {
      my (undef, $s) = @_;
      # remove delimiters from double-quoted string
      return $1 if $s =~ /^"(.*)"$/s;
      # remove delimiters from single-quoted string
      return $1 if $s =~ /^'(.*)'$/s;
      die "String was not delimited with single or double quotes";
    }