I'm trying to tokenize python source code with syntax errors to then give it as input into a statistical model (eg. a Recurring Neural Net).
However the built in tokenizer.py
yields ErrorToken
for the python code with syntax error.
This is the function I'm using:
def to_token_list(s: str) -> List:
tokens = [] # list of tokens extracted from source code.
g = tokenize.tokenize(BytesIO(s.encode("utf-8")).readline)
for t in g:
tokens.append(t)
return tokens
Here is an example input - it is missing a closing bracket (objects are masked as ID
):
syntax_error_source_code = "\ndef ID ID ):\n if ID .ID :\n ID .ID .ID ()\n"
to_token_list(syntax_error_source_code)
Error:
Exception has occurred: TokenError
('EOF in multi-line statement', (5, 0))
I could solve this error by wrapping the function in try-except
but it won't solve it for errors that are intermediary as the next example.
Another example that fails the try-except
:
syntax_error_source_code = '\ndef ID ():\n/ for ID ,ID in ID :\n pass \n for ID ,ID in ID :\n pass \n'
to_token_list(syntax_error_source_code)
Error:
Exception has occurred: IndentationError
unindent does not match any outer indentation level (<tokenize>, line 5)
I have found this discussion on the issue: https://bugs.python.org/issue12675
Is there a way to circumvent this?
I have found a solution to this problem from someone going through the same course
Below is the relevant part.
The tokenizer is:
import tokenize
import io
def tokenizer(
s: str, id: int, error_dict: dict
) -> List[tokenize.TokenInfo]:
fp = io.StringIO(s)
filter_types = [tokenize.ENCODING, tokenize.ENDMARKER, tokenize.ERRORTOKEN]
tokens = []
token_gen = tokenize.generate_tokens(fp.readline)
while True:
try:
token = next(token_gen)
if token.string and token.type not in filter_types:
tokens.append(token)
except tokenize.TokenError:
error_dict["TokenError"].append(id)
break
except StopIteration:
break
except IndentationError:
error_dict["IndentationError"].append(id)
continue
return tokens
Some clarification:
error_dict
is a dictionary of errors that might pop up, eg.: {"TokenError": [], "IndentationError": []}
s
is the source_code in string format that you want to tokenize
id
is the id of the source_code in a larger database, if you are tokenizing multiple snippets/files.