When looking at CPython's tokenizer.c
, the tokenizer returns specific error messages.
As an example, you can take a look at the part where the tokenizer tries to parse a decimal number. When trying to parse the number 5_6
everything should be OK, but when trying to parse the number 5__6
the tokenizer should return a SyntaxError with the message "invalid decimal literal":
static int
tok_decimal_tail(struct tok_state *tok)
{
int c;
while (1) {
do {
c = tok_nextc(tok);
} while (isdigit(c));
if (c != '_') {
break;
}
c = tok_nextc(tok);
if (!isdigit(c)) {
tok_backup(tok, c);
syntaxerror(tok, "invalid decimal literal");
return 0;
}
}
return c;
}
Using Python, I've tried to reach the tokenizer's SyntaxError
message:
In [12]: try:
...: eval('5__6')
...: except SyntaxError as e:
...: print(e.args, e.filename, e.lineno, e.msg, e.text)
('invalid token', ('<string>', 1, 2, '5__6')) <string> 1 invalid token 5__6
Is there any way to extract the SyntaxError message from the tokenizer?
You are looking at source code that is only present in Python 3.8a1 and newer, see the pull request that introduced this message in July 2018:
bpo-33305: Improve SyntaxError for invalid numerical literals. (GH-6517)
and the attached Python issue #33305.
When I run your code with Python 3.8b2 (the current beta) I see the message you expected:
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=8, micro=0, releaselevel='beta', serial=2)
>>> try:
... eval('5__6')
... except SyntaxError as e:
... print(e.args, e.filename, e.lineno, e.msg, e.text)
...
('invalid decimal literal',) <string> 1 invalid decimal literal None
You tried this out on Python 3.7 or older, so won't yet see the newer messages.