Search code examples
parsingprologlogicdcgdeterministic

Is it appropriate for a parser DCG to not be deterministic?


I am writing a parser for a query engine. My parser DCG query is not deterministic.

I will be using the parser in a relational manner, to both check and synthesize queries.

Is it appropriate for a parser DCG to not be deterministic?

In code:

If I want to be able to use query/2 both ways, does it require that

?- phrase(query, [q,u,e,r,y]).
true;
false.

or should I be able to obtain

?- phrase(query, [q,u,e,r,y]).
true.

nevertheless, given that the first snippet would require me to use it as such

?- bagof(X, phrase(query, [q,u,e,r,y]), [true]).
true.

when using it to check a formula?


Solution

  • The first question to ask yourself, is your grammar deterministic, or in the terminology of grammars, unambiguous. This is not asking if your DCG is deterministic, but if the grammar is unambiguous. That can be answered with basic parsing concepts, no use of DCG is needed to answer that question. In other words, is there only one way to parse a valid input. The standard book for this is "Compilers : principles, techniques, & tools" (WorldCat)

    Now you are actually asking about three different uses for parsing.

    1. A recognizer.
    2. A parser.
    3. A generator.

    If your grammar is unambiguous then

    1. For a recognizer the answer should only be true for valid input that can be parsed and false for invalid input.
    2. For the parser it should be deterministic as there is only one way to parse the input. The difference between a parser and an recognizer is that a recognizer only returns true or false and a parser will return something more, typically an abstract syntax tree.
    3. For the generator, it should be semi-deterministic so that it can generate multiple results.

    Can all of this be done with one, DCG, yes. The three different ways are dependent upon how you use the input and output of the DCG.


    Here is an example with a very simple grammar.

    The grammar is just an infix binary expression with one operator and two possible operands. The operator is (+) and the operands are either (1) or (2).

    expr(expr(Operand_1,Operator,Operand_2)) -->
        operand(Operand_1),
        operator(Operator),
        operand(Operand_2).
    
    operand(operand(1)) --> "1".
    operand(operand(2)) --> "2".
    
    operator(operator(+)) --> "+".
    
    recognizer(Input) :-
        string_codes(Input,Codes),
        DCG = expr(_),
        phrase(DCG,Codes,[]).
    
    parser(Input,Ast) :-
        string_codes(Input,Codes),
        DCG = expr(Ast),
        phrase(DCG,Codes,[]).
    
    generator(Generated) :-
        DCG = expr(_),
        phrase(DCG,Codes,[]),
        string_codes(Generated,Codes).
    
    :- begin_tests(expr).
    
    recognizer_test_case_success("1+1").
    recognizer_test_case_success("1+2").
    recognizer_test_case_success("2+1").
    recognizer_test_case_success("2+2").
    
    test(recognizer,[ forall(recognizer_test_case_success(Input)) ] ) :-
        recognizer(Input).
    
    recognizer_test_case_fail("2+3").
    
    test(recognizer,[ forall(recognizer_test_case_fail(Input)), fail ] ) :-
        recognizer(Input).
    
    parser_test_case_success("1+1",expr(operand(1),operator(+),operand(1))).
    parser_test_case_success("1+2",expr(operand(1),operator(+),operand(2))).
    parser_test_case_success("2+1",expr(operand(2),operator(+),operand(1))).
    parser_test_case_success("2+2",expr(operand(2),operator(+),operand(2))).
    
    test(parser,[ forall(parser_test_case_success(Input,Expected_ast)) ] ) :-
        parser(Input,Ast),
        assertion( Ast == Expected_ast).
    
    parser_test_case_fail("2+3").
    
    test(parser,[ forall(parser_test_case_fail(Input)), fail ] ) :-
        parser(Input,_).
    
    test(generator,all(Generated == ["1+1","1+2","2+1","2+2"]) ) :-
        generator(Generated).
    
    :- end_tests(expr).
    

    The grammar is unambiguous and has only 4 valid strings which are all unique.

    The recognizer is deterministic and only returns true or false.
    The parser is deterministic and returns a unique AST.
    The generator is semi-deterministic and returns all 4 valid unique strings.

    Example run of the test cases.

    ?- run_tests.
    % PL-Unit: expr ........... done
    % All 11 tests passed
    true.
    

    To expand a little on the comment by Daniel

    As Daniel notes

    1 + 2 + 3 
    

    can be parsed as

    (1 + 2) + 3 
    

    or

    1 + (2 + 3)
    

    So 1+2+3 is an example as you said is specified by a recursive DCG and as I noted a common way out of the problem is to use parenthesizes to start a new context. What is meant by starting a new context is that it is like getting a new clean slate to start over again. If you are creating an AST, you just put the new context, items in between the parenthesizes, as a new subtree at the current node.

    With regards to write_canonical/1, this is also helpful but be aware of left and right associativity of operators. See Associative property

    e.g.

    + is left associative

    ?- write_canonical(1+2+3).
    +(+(1,2),3)
    true.
    

    ^ is right associative

    ?- write_canonical(2^3^4).
    ^(2,^(3,4))
    true.
    

    i.e.

    2^3^4 = 2^(3^4) = 2^81 = 2417851639229258349412352
    
    2^3^4 != (2^3)^4 = 8^4 = 4096
    

    The point of this added info is to warn you that grammar design is full of hidden pitfalls and if you have not had a rigorous class in it and done some of it you could easily create a grammar that looks great and works great and then years latter is found to have a serious problem. While Python was not ambiguous AFAIK, it did have grammar issues, it had enough issues that when Python 3 was created, many of the issues were fixed. So Python 3 is not backward compatible with Python 2 (differences). Yes they have made changes and libraries to make it easier to use Python 2 code with Python 3, but the point is that the grammar could have used a bit more analysis when designed.