Search code examples
pythonsyntax-highlightingpython-idle

Python IDLE miscoloring with "print" statement


proof of point http://adams-site.x10.mx/v/python.png

You'll notice in this image the two print statements are different colours.

It doesn't really matter a great deal, I'm not really bothered, but I thought it would be nice to know why, or if this is just a bug.

(I have seen this link, but I really would like to know why.)


Solution

  • This is the code responsible for syntax highlighting from ColorDelegator.py:

    def any(name, alternates):
        "Return a named group pattern matching list of alternates."
        return "(?P<%s>" % name + "|".join(alternates) + ")"
    
    def make_pat():
        kw = r"\b" + any("KEYWORD", keyword.kwlist) + r"\b"
        builtinlist = [str(name) for name in dir(__builtin__)
                                            if not name.startswith('_')]
        # self.file = file("file") :
        # 1st 'file' colorized normal, 2nd as builtin, 3rd as string
        builtin = r"([^.'\"\\#]\b|^)" + any("BUILTIN", builtinlist) + r"\b"
        comment = any("COMMENT", [r"#[^\n]*"])
        sqstring = r"(\b[rRuU])?'[^'\\\n]*(\\.[^'\\\n]*)*'?"
        dqstring = r'(\b[rRuU])?"[^"\\\n]*(\\.[^"\\\n]*)*"?'
        sq3string = r"(\b[rRuU])?'''[^'\\]*((\\.|'(?!''))[^'\\]*)*(''')?"
        dq3string = r'(\b[rRuU])?"""[^"\\]*((\\.|"(?!""))[^"\\]*)*(""")?'
        string = any("STRING", [sq3string, dq3string, sqstring, dqstring])
        return kw + "|" + builtin + "|" + comment + "|" + string +\
               "|" + any("SYNC", [r"\n"])
    

    It builds up a large regular expression which it uses to match items to colour. In particular, the regex defined as kw will match a keyword (as defined by the keyword module) anywhere it's found in the source file, while the regex defined as builtin will match a builtin (as discovered by scanning __builtin__) as long as it doesn't follow a period, quote, double-quote, backslash or hash symbol.

    Now, there are a combination of factors at work to give the strange behaviour you see. First of all, in Python 2.7 print is both a keyword and a builtin. (I'm not sure why, but I imagine it might be to keep closer to Python 3.0 where print is obviously a builtin and not a keyword.) So a regex is constructed that can match print as either a keyword or a builtin. But why does it sometimes match as one and sometimes as the other?

    The difference is due to the construction of the regex. At the start of a line, the kw regex matches from the first character and it matches before the rest can be considered. However, after the start of the line, the builtin regex actually matches a character earlier, because the first character it looks for is "any character that isn't a period, quote, double-quote, backslash or hash". Even though that character isn't included in the labelled group, it's still part of the match. So when print is preceded by a space or tab, the builtin regex matches first.

    One way to fix this would be to use a negative lookbehind assertion, but such a complicated regular expression already makes me a bit nervous and I'm never sure which regex features can result in catastrophic performance degradation. A simpler fix is to filter out any builtins that are also keywords before constructing the regex, and that's exactly what has been done in Python 3.2.2, as described in the bug report linked to from the question you reference.