Search code examples
gccutf-8g++byte-order-mark

Is it possible to get GCC to compile UTF-8 with BOM source files?


I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on Ubuntu Linux.

In Visual Studio, I can use Unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).

For example:

// A = π.r²
double π = 3.14;

GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:

wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program

wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program

Which brings me to the question:

Is there a way to get GCC to compile UTF-8 files without first removing the BOM?


I'm using:

and:


As the first commenter pointed out, my problem was not the BOM, but having non-ASCII characters outside of string constants. GCC does not like non-ASCII characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.


Solution

  • According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers and pre-process your code to convert the identifiers to UCN. From the linked page:

    perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;' 
    

    See also g++ unicode variable name and Unicode Identifiers and Source Code in C++11?