Search code examples
pythonply

How is PLY's parsetab.py formatted?


I'm working on a project to convert MATLAB code to Python, and have been somewhat successful after building off others work. The tool uses PLY (an implementation of lex and yacc parsing tools for Python) to parse the MATLAB input. Unfortunately, it is a requirement that my code is written in Python 3, not Python 2. The tool runs without issue in Python 2, but I get a strange error in Python 3 (Assuming A is an array):

    log_idx = A <= 16;
                  ^
SyntaxError: Unexpected "=" (parser)

The MATLAB code I am trying to convert is:

idx = A <= 16;

which should convert to almost the same thing in Python 3:

idx = A <= 16

The only real difference between the Python 3 code and the Python 2 code is the PLY-generated parsetab.py file, which has substantial differences in the following variables:

_tabversion
_lr_signature
_lr_action_items
_lr_goto_items

I'm having trouble understanding the purpose of these variables and why they could be different when the only difference was the Python version used to generate the parsetab.py file.

I tried searching for documentation on this, but was unsuccessful. I originally suspected it could be a difference in the way strings are formatted between Python 2 and Python 3, but that didn't turn anything up either. Is there anyone familiar with PLY that could give some insight into how these variables are generated, or why the Python version is creating this difference?

Edit: I'm not sure if this would be useful to anyone because the file is very long and cryptic, but below is an example of part of the first lines of _lr_action_items and _lr_goto_items

Python 2:

_lr_action_items = {'DOTDIV':([6,9,14,20,22,24,32,34,36,42,46,47,52,54,56,57,60,71,72,73,74,75 ...
_lr_goto_items = {'lambda_args':([45,80,238,],[99,161,263,]),'unwind':([1,8,28,77,87,160,168,177 ...

Python 3:

_lr_action_items = {'END_STMT':([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,39,41,48,50 ...
_lr_goto_items = {'top':([0,],[1,]),'stmt':([1,44,46,134,137,207,212,214,215,244,245,250 ...

Solution

  • I'm going to go out on a limb here, because you have provided practically no indication of what code you are actually using. So I'm just going to assume that you copied the lexer.py file from the github repository you linked to in your question.

    There's an important clue in this error message:

    log_idx = A <= 16;
                  ^
    SyntaxError: Unexpected "=" (parser)
    

    Evidently, <= is not being scanned as a single token; otherwise, the parser would not see an = token at that point in the input. This can only mean that the scanner is returning two tokens, < and =, and if that's the case, it is most certainly a syntax error, as you would expect from

    log_idx = A < = 16;
    

    To figure out why the lexer would do this, it's important to understand how the Ply (default) lexer works. It gathers up all the lexer patterns from variables whose names start t_, which must be either functions or variables whose values are strings. It then sorts them as follows:

    1. function docstrings, in order by line number in the source file.
    2. string values, in reverse order by length.

    See Specification of Tokens in the Ply manual.

    That usually does the right thing, but not always. The intention of sorting in reverse order by length is that a prefix pattern will come after a pattern which matches a longer string. So if you had patterns '<' and '<=', '<=' would be tried first, and so in the case where the input had <=, the < pattern would never be tried. That's important, since if '<' is tried first, '<=' will never be recognised.

    However, this simple heuristic does not always work. The fact that a regular expression is shorter does not necessarily mean that its match will be shorter. So if you expect "maximal munch" semantics, you sometimes have to be careful about your patterns. (Or you can supply them as docstrings, because then you have complete control over the order.)

    And whoever created that lexer.py file was not careful about their patterns, because it includes (among other issues):

    t_LE          = r"<="
    t_LT          = r"\<"
    

    Note that since these are raw strings, the backslash is retained in the second string, so both patterns are of length 2:

    >>> len(r"\<")
    2
    >>> len(r"<=")
    2
    

    Since the two patterns have the same length, their relative order in the sort is unspecified. And it is quite possible that the two versions of Python produce different sort orders, either because of differences in the implementation of sort or because of differences in the order which the dictionary of variables is iterated, or some combination of the above.

    < has no special significance in a Python regular expression, so there is no need to backslash-escape it in the definition of t_LT. (Clearly, since it is not backslash-escaped in t_LE.) So the simplest solution would be to make the sort order unambiguous by removing the backslash:

    t_LE          = r"<="
    t_LT          = r"<"
    

    Now, t_LE is longer and will definitely be tried first.

    That's not the only instance of this problem in the lexer file, so you might want to revise it carefully.

    Note: You could also fix the problem by adding an unnecessary backslash to the t_LE pattern; there is an argument for taking the attitude, "When in doubt, escape." However, it is useful to know which characters need to be escaped in a Python regex, and the Python documentation for the re package contains a complete list. Also, consider using long raw strings for patterns which include quotes, since neither " nor ' need to be backslash escaped in a Python regex.