Search code examples
pythonparsingtokenply

How to have a "default" token with PLY?


I have a text to parse that contains some amount of stuff that is non-relevant for the parsing. For this reason I would like to be able to tokenize as "TEXT" anything that does not follow the specific patterns I am looking for.

For example, let's say I am looking for the sequences "HELP!" and "OVER HERE!". I would like the sequence "some random text HELP! lorem ipsum" to be tokenized as: (TEXT,'some random text '), (HELP,'HELP!'), (TEXT:' lorem ipsum').

If I do that:

import ply.lex as lex


tokens = (
    'TEXT',
    'SIGNAL1',
    'SIGNAL2'
)

t_SIGNAL1 = "HELP!"
t_SIGNAL2 = "OVER HERE!"

t_TEXT = r'[\s\S]+'

data = "some random text HELP! lorem ipsum"
lexer = lex.lex()
lexer.input(data)
while True:
    tok = lexer.token()
    if not tok:
        break  # No more input
    print(tok)

It fails, of course, because the TEXT token grabs the whole text. I could change the regex for t_TEXT into something more fancy, but as I have a big dozen of different specific sequences I want to capture it would be completely unreadable.

I feel like there should be an easy solution for that, but can't figure one out.


Solution

  • Ply's lexer tries patterns in a determined order, which can be exploited to define a default token. But there are a couple of down-sides to this approach.

    The defined order is:

    1. Ignored characters, from the definition of t_ignore.

    2. Tokens matched by a token function, in order by function definition order.

    3. Tokens matched by a token variable, in reverse order by regular expression length.

    4. Literal characters, from the definition of literals.

    5. The t_error function, which is called if none of the above match.

    Aside from t_error, all of the above are conditional on some pattern matching at least one character at the current input point. So the only reliable fallback (default) pattern would be one which matches any single character: (?s:.) (or just ., if you're willing to globally set the re.S flag). That could be used as the last token function, provided you don't use any token variables nor literal characters, or it could be used as a token variable, provided that it is shorter than any other token variable pattern, and that you don't use literal characters. (That might be easier if you could use ., but you'd still need to ensure that no other variable's pattern had a length of 1.)

    The main problem with this approach (other than the inefficiencies it creates) is that the default token is just one character long. In order to implement something like your TEXT token, consisting of the entire sea between the islands you want to parse, you would need to consolidate sequences of consecutive TEXT tokens into a single token. This could be done reasonably easily in your parser, or it could be done using a wrapper around the lexical scanner, but either way it's an additional complication.

    Alternatively, you could use t_error as a fallback. t_error is only called if nothing else matches, and if t_error returns a token, the Ply lexer will use that token. So in some ways, t_error is the ideal fallback. (But note that Ply does consider t_error to indicate an error. For example, it will be logged as an error, if you've enabled debug logging.)

    The advantage to this approach is that the t_error function can absorb as many input characters as desired, using what every mechanism you consider most appropriate. In fact, it must do this, by explicitly incrementing the value of t.lexer.lexpos (which is what the skip method does); otherwise, an exception will be raised.

    But there's a problem: before calling t_error(t), the lexer sets t.lexdata to (a copy of) the input string starting at the current input point. If t_error is called frequently, the cost of these copies could add up, possibly even turning the parse from linear time to quadratic time.

    That doesn't free you from the problem of figuring out what the extent of the fallback token should be. As mentioned, t_error is not limited to the use of a precise regular expression, but it's not always obvious what other mechanism could be used.

    So that brings us to the third possibility, which is to construct a regular expression which actually matches the text between useful tokens.

    In most cases, that can actually be done mechanically, given that all the token patterns are available, either as the value of specific member variables or as the docstring of specific member functions. If you have this list of regular expression strings, you can create a pattern which will match the text up to the first such match, using a lookahead assertion:

        # This leaves out the construction of the list of patterns.
        @TOKEN(f".*?(?={'|'.join(f'(?:{p})' for p in patterns)})")
        def t_TEXT(t):
            return t
    

    Note that patterns must also include a pattern which matches the character sets t_ignore and literals.