Search code examples
pythonply

PLY token priority issue


I'm using PLY to lex and parse some .tex files. For some unknown reasons token priority does not work as described in the documentation.

Here are the tokens and the states:

tokens = ('BT', 'BL', 'BD', 'BCONJ', 'BCOR', 'BE', 'ET', 'EL', 'ED', 'ECONJ', 'ECOR', 'EE', 'SEC', 'SSEC', 'SSSEC', 'ES', 'TEXT','ITEXT','BIBS','MT',)

states = (('ig', 'exclusive'), ('sec', 'exclusive'))

Here are the functions used by the lexer:

def t_ig_BT(t):
    r'\\begin\{theorem\}'
    t.lexer.begin('INITIAL')
    return t

def t_ig_BL(t):
    r'\\begin\{lemma\}'
    t.lexer.begin('INITIAL')
    return t

def t_ig_BD(t):
    r'\\begin\{definition\}'
    t.lexer.begin('INITIAL')
    return t

def t_ig_BCONJ(t):
    r'\\begin\{conjecture\}'
    t.lexer.begin('INITIAL')
    return t

def t_ig_BCOR(t):
    r'\\begin\{corollary\}'
    t.lexer.begin('INITIAL')
    return t

def t_ig_BE(t):
    r'\\begin\{example\}'
    t.lexer.begin('INITIAL')
    return t

def t_ET(t):
    r'\\end\{theorem\}'
    t.lexer.begin('ig')
    return t

def t_EL(t):
    r'\\end\{lemma\}'
    t.lexer.begin('ig')
    return t

def t_ED(t):
    r'\\end\{definition\}'
    t.lexer.begin('ig')
    return t

def t_ECONJ(t):
    r'\\end\{conjecture\}'
    t.lexer.begin('ig')
    return t

def t_ECOR(t):
    r'\\end\{corollary\}'
    t.lexer.begin('ig')
    return t

def t_EE(t):
    r'\\end\{example\}'
    t.lexer.begin('ig')
    return t

def t_INITIAL_ig_SEC(t):
    r'\\section\{'
    t.lexer.begin('sec')
    return t

def t_ig_SSEC(t):
    r'\\subsection\{'
    t.lexer.begin('sec')
    return t

def t_ig_SSSEC(t):
    r'\\subsubsection\{'
    t.lexer.begin('sec')
    return t

def t_sec_ES(t):
    r'\}'
    t.lexer.begin('ig')
    return t

def t_ig_BIBS(t):
    r'\\bibliographystyle'
    t.lexer.begin('INITIAL')
    return t

def t_INITIAL_MT(t):
    r'\\maketitle'
    t.lexer.begin('ig')
    return t

def t_INITIAL_sec_TEXT(t):
    r'[\s\S]+'
    return t

def t_ig_ITEXT(t):
    r'[\s\S]+'
    pass

def t_ANY_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

The program is supposed to detect the beginning, the end, sections, subsections, subsubsections, theorems, lemmas, definitions, conjectures, corollaries and examples in a math paper and ignore the rest of the contents to produce a summary.

In the beginning the program is supposed to retain all characters until reaching token MT. In this case the lever should preserve the token and enter ig mode. Then it should ignore all characters unless it detects a theorem/lemma/definition/conjecture/corollary/example, in which case it temporarily enters the INITIAL mode and retain it or a (sub/subsub)section in which case it should temporarily enter the sec mode.

Right now it seems that in the state INITIAL the entire file is considered TEXT when I do not want it to match as much as possible.


Solution

  • OK I think I know what's wrong. The issue here is that when r'[\s\S]+' is matched it literally matches everything it can match which is the entire file. I changed the definition of TEXT to r'[\s\S]' and allowed changed the parser which works.