Search code examples
pythonsyntax-errortokenize

How can I tokenize python source code that has a syntax error?


I'm trying to tokenize python source code with syntax errors to then give it as input into a statistical model (eg. a Recurring Neural Net).

However the built in tokenizer.py yields ErrorToken for the python code with syntax error.

This is the function I'm using:

def to_token_list(s: str) -> List:
    tokens = []  # list of tokens extracted from source code.

    g = tokenize.tokenize(BytesIO(s.encode("utf-8")).readline)

    for t in g:
        tokens.append(t)

    return tokens

Here is an example input - it is missing a closing bracket (objects are masked as ID):

syntax_error_source_code = "\ndef ID ID ):\n    if ID .ID :\n        ID .ID .ID ()\n"
to_token_list(syntax_error_source_code)

Error:

Exception has occurred: TokenError
('EOF in multi-line statement', (5, 0))

I could solve this error by wrapping the function in try-except but it won't solve it for errors that are intermediary as the next example.

Another example that fails the try-except:

syntax_error_source_code = '\ndef ID ():\n/    for ID ,ID in ID :\n        pass \n    for ID ,ID in ID :\n        pass \n'
to_token_list(syntax_error_source_code)

Error:

Exception has occurred: IndentationError
unindent does not match any outer indentation level (<tokenize>, line 5)

I have found this discussion on the issue: https://bugs.python.org/issue12675

Is there a way to circumvent this?


Solution

  • I have found a solution to this problem from someone going through the same course

    Below is the relevant part.

    The tokenizer is:

    import tokenize
    import io
    
    def tokenizer(
            s: str, id: int, error_dict: dict
        ) -> List[tokenize.TokenInfo]:
        
        fp = io.StringIO(s)
        filter_types = [tokenize.ENCODING, tokenize.ENDMARKER, tokenize.ERRORTOKEN]
        tokens = []
        token_gen = tokenize.generate_tokens(fp.readline)
        while True:
            try:
                token = next(token_gen)
                if token.string and token.type not in filter_types:
                    tokens.append(token)
            except tokenize.TokenError:
                error_dict["TokenError"].append(id)
                break
            except StopIteration:
                break
            except IndentationError:
                error_dict["IndentationError"].append(id)
                continue
        return tokens
    

    Some clarification:

    • error_dict is a dictionary of errors that might pop up, eg.: {"TokenError": [], "IndentationError": []}

    • s is the source_code in string format that you want to tokenize

    • id is the id of the source_code in a larger database, if you are tokenizing multiple snippets/files.