Search code examples
erlangleexyecc

Proper way to parse multiple items


I have an input file with multiple lines and fields separated by space. My definition files are:

scanner.xrl:

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t)+ : skip_token.
\n : {end_token, {new_line, TokenLine}}.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

parser.yrl:

Nonterminals line.

Terminals string.

Rootsymbol line.

Endsymbol new_line.

line -> string : ['$1'].
line -> string line: ['$1'|'$2'].

Erlang code.

When running it as it is, the first line is parsed and then it stops:

1> A = <<"a b c\nd e\nf\n">>.

2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {new_line,1},
     {string,2,"d"},
     {string,2,"e"},
     {new_line,2},
     {string,3,"f"},
     {new_line,3}],
    4}
3> parser:parse(T).
{ok,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]}

If I remove the Endsymbol line from parser.yrl and change the scanner.xrl file as follow:

Definitions.

DIGIT = [0-9]
ALPHANUM = [0-9a-zA-Z_]

Rules.

(\s|\t|\n)+ : skip_token.
{ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.

Erlang code.

All my line are parsed as a single item:

1> A = <<"a b c\nd e\nf\n">>.
<<"a b c\nd e\nf\n">>
2> {ok, T, _} = scanner:string(binary_to_list(A)).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {string,2,"d"},
     {string,2,"e"},
     {string,3,"f"}],
    4}
3> parser:parse(T).
{ok,[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"},
     {string,2,"d"},
     {string,2,"e"},
     {string,3,"f"}]}

What would be the proper way to signal to the parser that each line should be treated as a separate item? I would like my result to look something like:

{ok,[[{string,1,"a"},
     {string,1,"b"},
     {string,1,"c"}],
     [{string,2,"d"},
     {string,2,"e"}],
     [{string,3,"f"}]]}

Solution

  • Here is one of the correct lexer/parser pair that does the job with 1 shift/reduce only but I think it will solve your problem, you only need to cleanup tokens as you prefer.

    I'm pretty sure there can be much easier and faster way to do it, but during my "lexer fighting times" it was so hard to find at least some information that I hope this will give the idea how to proceed with parsing with Erlang.

    scanner.xrl

    Definitions.
    
    DIGIT = [0-9]
    ALPHANUM = [0-9a-zA-Z_]
    
    Rules.
    
    (\s|\t)+ : skip_token.
    \n : {token, {line, TokenLine}}.
    {ALPHANUM}+ : {token, {string, TokenLine, TokenChars}}.
    
    Erlang code.
    

    parser.yrl

    Nonterminals 
        Lines
        Line
        Strings.
    
    Terminals string line.
    
    Rootsymbol Lines.
    
    Lines -> Line Lines : lists:flatten(['$1', '$2']).
    Lines -> Line : lists:flatten(['$1']).
    
    Line -> Strings line : {line, lists:flatten(['$1'])}.
    Line -> Strings : {line, lists:flatten(['$1'])}.
    
    Strings -> string Strings : lists:append(['$1'], '$2').
    Strings -> string : lists:flatten(['$1']).
    
    Erlang code.
    

    output

    {ok,[{line,[{string,1,"a"},{string,1,"b"},{string,1,"c"}]},
         {line,[{string,2,"d"},{string,2,"e"}]},
         {line,[{string,3,"f"}]}]}
    

    The parser flow is the following:

    • Root defined as abstract "Lines"
    • "Lines" contains "Line + Lines" or simply "Line", which gives the looping
    • "Line" contains from "Strings + line" or simple "Strings" when it is end of file
    • "Strings" contains from 'string' or "'string' + Strings" when there are many strings provided
    • 'line' is the '\n' symbol

    Please allow me to give few comments on issues I've discovered in the original code.

    • You should consider a whole file as a nested array not like a parsing per line, this is why Lines/Line abstracts provided
    • "Terminals" means that tokens won't be analysed for containing ANY other token, "Nonterminals" will be evaluated further, these are complex data