Search code examples
pythonparsingply

PLY resolving ambiguity for simple sentence parser


I am having an issue where I need to parse very simple sentences into BNF parses. I am able to label the tokens easily and have simple sentences working. However, there are several situations where either a reduce rule can be applied or a shift can happen and another reduce could potentially happen. For instance:

Jared chased Tom and Jerry ate.

I have one rule that reduces noun conjunction noun to a noun phrase, and another rule that reduces sentence conjunction sentence to a compound sentence. At "Jerry" it seems like it automatically assumes that the sentence is of the form noun verb noun phrase, as opposed to sentence conjunction sentence, and throws an error when it encounters "ate" as there is no rewrite rule for NP VP NP VP.

How can I change this so that IF the words to the right of the conjunction are a sentence, it parses it as S CONJ S, but if it is not, then it is parsed as NP VP NP?

Edit: for clarification, here is my current code:

import ply.yacc as yacc
from lexer import tokens

precedence = (('left', 'Vi'),('left', 'N'))

def p_S(p):
    """S : NP VP"""
    p[0] = '[S ' + p[1] + ' ' + p[2] + ' ]'

def p_VP(p):
    """VP : VP Conj VP"""
    p[0] = '[VP ' + p[1] + ' ' + p[2] + ' ' + p[3] + ' ]'

def p_Vi(p):
    """VP : Vi"""
    p[0] = '[VP [VI ' + p[1] + ' ] ]'

def p_Vt(p):
    """VP : Vt NP"""
    p[0] = '[VP [Vt ' + p[1] + ' ] ' + p[2] + ' ]'

def p_Vd(p):
    """VP : Vd NP NP"""
    p[0] = '[VP [VT ' + p[1] + ' ] ' + p[2] + ' ' + p[3] + ' ]'

def p_NP(p):
    """NP : NP Conj NP
    | N"""
    if len(p) == 4:
        p[0] = '[NP ' + p[1] + ' [Conj ' + p[2] + ' ] ' + p[3] + ' ]'
    else:
        p[0] = '[N ' + p[1] + ' ]'

def p_error(p):
    print("Syntax Error in input!\n")

parser = yacc.yacc()
while True:

    try:
        s = input('calc > ')
        if not s: continue
        result = parser.parse(s)
        if result is not None: print(result + '\n')
    except EOFError:
        print("Please try again")

Solution

  • You can't, at least not with a deterministic single-token lookahead parser, which is what Ply builds.

    Fortunately (or not) the human brain is not limited to Knuthian left-to-right parsing, although the details of how we parse sentences are not entirely clear. What's more, most human languages are actually ambiguous and only semantic analysis can distinguish between the multiple possible parses. (In spoken language, there are also a number of other language features, such as intonation, inter-word spacing, and emphasis, which also help guide the parse. These features are not usually transcribed in written language, but written language can be processed non-linearly; the eye can reread or scan ahead if necessary.)

    For human language parsing, GLR or similar algorithms will prove to be more useful. Although Bison can produce GLR parsers, that feature has not yet been implented in Ply, as far as I know. Chart parsing is not particularly complicated; you could probably code your own without too much work. However, unless you are interested in parsing algorithms, there is not much point, since there are existing packages for human language processing in Python.

    (SO discourages library recommendations, but I believe that NLTK, the natural language toolkit, is probably the most widely-used Python framework for human language processing.)