Search code examples
c++ccompilationspecial-charactersidentifier

C identifier names: What goes with which compiler?


I was experimenting with extern and extern "C" for a little, and accidentially had a typo in one of the identifiers - a $ had snuck in. When I compiled the code and got the error of an undefined symbol and eventually saw what caused it, it made me curios if it would actually compile. And guess what - Clang actually did compile that.

According to documentation I had read previously, the rules for identifiers were basically:

  • No double underscore at the beginning - because those are reserved.
  • No single underscore and upper case letter - reserved too.
  • Must start with a letter, a non-digit.
  • Must not exceed 31 characters.
  • May contain a-z, A-Z or 0-9 and _.

But this compiled just fine - no warning was showing too:

void __this$is$a$mess() {}
int main() { __this$is$a$mess(); }

When looking at it:

Ingwie@Ingwies-Macbook-Pro.local /tmp $ clang y.c
Ingwie@Ingwies-Macbook-Pro.local /tmp $ nm a.out
0000000100000f90 T ___this$is$a$mess
0000000100000000 T __mh_execute_header
0000000100000fa0 T _main
                 U dyld_stub_binder

I can see the symbol name very clearly.

So why is it that Clang will let me do this, although by ANSI standards, it should not? Even the GCC 6 I have installed did not warn or error about this.

Which compilers will allow what kinds of identifiers - and, why actually?


Solution

  • The rules in the 2018 C standard for identifiers include:

    • Per 6.4.2.1 1, an identifier is a sequence of identifier-nondigit and digit characters, starting with an identifier-nondigit.
    • An identifier-nodigit is _, a to z, A to Z, a universal-character-name, or “other implementation-defined characters”.
    • A digit is 0 to 9.
    • A universal-character-name is \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits, which specify Unicode characters.

    So, if an implementation allows $, that is a valid character for that implementation. You may use it, but it may not be portable to other implementations. The C standard requires implementations to accept the specific characters listed, but it allows them to accept more. Generally, the C standard should be viewed as an open field rather than a walled garden: The behavior is defined within the field, but you are not stopped at the barrier; you may go beyond it, at your own risk.

    The rules you were taught were rules for what is portable, not rules for what the C standard requires implementations to restrict you to.

    The C standard defines strictly conforming code, which is, roughly speaking, code that should work in any C implementation, and conforming code, which is code that works in at least one C implementation. Conforming code is still C code. So the rules you were taught were for strictly conforming code.

    Generally, you should prefer to write strictly conforming code and only use additional features when benefit (speed, ease of development on a particular platform, whatever) is worth the cost (loss of portability).