What's the difference between the token _123jh
and 123jh
that makes most lexers not include a number-starting identifier? I suppose one reason might be that a number-only token might be confusing, and so it's easier to just eliminate the leading-number entirely as opposed to allowing something like:
^(\d+[A-z_][A-z_0-9]*|[_A-z][A-z0-9]*)$
.
Or are there other reasons as well (maybe a lexer cannot guarantee a single-char lookahead using this way)?
Because most languages support numbers, and in particular, floating point numbers in exponential notation, complex numbers, and hexadecimal. If a variable can begin with a digit, why can't it be all digits? And if so, how do you distinguish it from an actual number? Worse, 9e9
is a perfectly legal number, and isn't even all digits, so you can't say "it has to have at least one non-digit character", because legal numbers can have non-digit characters too.
Then we add on the languages that allow arbitrary insertion of underscores within (but not at the ends of) numeric literals to aid with readability (e.g. Python allows 1_000_000
to represent the same thing as 1000000
but make the three digit groupings easier), or have suffixes to adjust the type (e.g. 5U
/123L
in C/C++, 7u8
/999i16
/etc. in Rust) and you end up in a position where allowing variable names to start with a digit means you need to impose all sorts of other, far more arbitrary restrictions on variable names to avoid conflicting with numeric literals.
Basically, it's a lot easier to say "it has to start with a non-digit character" than to arbitrarily say 9e9
and 0xf
are not a legal variable name, but 9ee9
and 0xg
are.