compiler-construction clang llvm emscripten

what is the motivation for the existence of specific clang versions (like the emscripten one)?

I recently started to do some work with the emscripten c/c++ to javascript compiler, and when trying to build the compiler from source, i saw that it have a specific version of clang for itself.

Until now, I couldn't find anywhere a reason why there is a separate version of the compiler. I was under the impression that every backend could get input from every frontend, if you follow llvm specs, and use the same llvm version for both. I can imagine that one could use this approach to use specific command line options, but can't see advantages of rebuilding the entire thing, over a script to do that job and connecting the dots, instead.

So, what exactly are the advantages of doing a specific clang build, over just implementing a backend?

Solution

Implementing a backend is exactly what we're doing for WebAssembly, and Emscripten will eventually be able to use that backend for both WebAssembly and asm.js.

Emscripten as well as PNaCl were started as unsure experiments, without sufficient experience with LLVM. It was easier for the authors to write their own thing in a way that they though sensible, versus meeting the constraints of LLVM. As these things usually go, those experiments haven't fully gone away... But over time WebAssembly should make everything right.

LLVM's constraints are entirely sensible: keep a high code quality, don't raise maintenance burden, avoid creating needless dependencies between layers of the code. That allows LLVM to be successful: when it accepts new backends LLVM wants to be better as a whole, not be burdened by a new backend, or have a backend which exists in isolation, or becomes unmaintained. Others constraints are more historic: when PNaCl started in ~2010 (video) LLVM didn't have great support for x86 instruction bundling which NaCl's x86 sandboxes relies on, or for x32 which NaCl relies on for its x86-64 sandbox, and both for PNaCl and Emscripten it wasn't clear how virtual ISAs could be supported at the time. I'm oversimplifying, many other factors went into those decisions, and I'm sure that were they made today things would go differently.

PNaCl still has substantial changes (LLVM and clang forks), despite many of them making it to upstream LLVM and PNaCl developers participating in code reviews which made LLVM friendlier for PNaCl's usecases. These changes live in three categories: backend changes necessary for NaCl sandboxing (which subzero intends to replace), bitcode "simplification" passes which stand in for a backend done "the LLVM way" (and which Emscripten uses), and other random changes many of which could be upstreamed.

Emscripten saw a substantial approach change when it moved to fastcomp instead of its previous approach.

Note that there's much more than LLVM and clang: these compilers also rely on C++ standard libraries (both have libc++ / libc++abi, mostly unchanged, PNaCl used to support libstdc++), C standard libraries (both have musl, PNaCl also has newlib, bionic, some form of glibc), compiler runtime (compiler-rt), linker, and general user libraries such as SDL (part of Emscripten, and for PNaCl in webports).

In both cases the compilers are rebased to trunk LLVM semi-frequently, though Emscripten's changes are much easier to rebase than PNaCl's. There's a huge cost to maintaining a fork!