Could a compiler produce platform independent binaries without system calls nor library calls?

I don't know a lot about low level things. But, as far as I understand, compilers produce binaries with system and library calls inside.

If you don't make any system calls nor call any libraries, are binary files just raw machine code, executable without any change when you're on the same machine but on another OS ? What else links binaries to their OS ? Is it possible to write a cross-platform "Hello world" ASM program without any system call ? Would it require to have access to kernel space hence it is impossible ?

Solution

No.

Executables have metadata, not just the raw machine code, and different OSes use different formats. (Unless you have a DOS .com executable; that format has no metadata and just gets loaded into memory at a fixed offset relative to whatever segment the OS chooses.) This lets you get a useful error message (instead of running it and getting a segfault) when trying to run an x86-64 FreeBSD executable on x86-64 OpenBSD or Linux, even if they all use the same format (ELF).

Or let the OS invoke an emulator on it transparently, for example how macOS on AArch64 hardware can run x86-64 binaries through Rosetta, or Linux invoking QEMU via binfmt-misc, instead of loading machine code for the wrong architecture and having the CPU fault with an illegal instruction or page-fault or something pretty soon.

Hello World needs to make a write system call or MessageBoxA equivalent WinAPI call, or something similar. System-call ABIs are OS-specific.

e.g. on x86-64 Linux, mov eax, 1 / syscall is a write system call. On x86-64 MacOS, mov eax, 0x2000004 / syscall. And those are both Unix-like OSes; others like Windows don't even have a stable system-call ABI; the only portable (across Windows versions) way to do anything outside your own process is by calling DLL functions, and that depends on Windows-specific metadata in your executable.

Assuming you had a portable way to get a blob of machine-code for the right ISA running, you can portably do some number crunching in an infinite loop or something, but communicating the results outside of your process requires a system-call. (Assuming a modern mainstream OS worthy of the name, i.e. which does memory protection and multi-tasking, so unlike MS-DOS doesn't let user-space processes access any hardware directly.)

Even exiting portably is a problem; normally that's done with an exit system-call, thus OS-specific. But if your process faults (e.g. illegal instruction, accessing a bad address, or trying to run a privileged instruction), the OS will kill your process. Since you won't have made earlier system calls to set up a handler or something.

So no, not Hello World; all you can do without system-calls is modify your own memory; mainstream OSes don't start processes with any special areas of memory memory-mapped to anything visible from outside themselves.

You could communicate (exfiltrate data) via a side-channel such as CPU temperature, or performance counters or performance effects on other processes (e.g. memory bandwidth or how fast you evict data from shared L3 cache, either looping over a large amount of stack space or not). Different instruction mixes will heat up a CPU more vs. less; you don't have the option of having your process sleep (that would require a system call), but a loop running x86 pause instructions will use much less power than a block of vmulpd ymm0, ymm1, ymm2 instructions. (256-bit floating-point multiplies, or FMAs probably even more power-intensive.)

This of course can't make "Hello World" appear in a console unless you're running another process that checks CPU temperature frequently and collect output a few bits at a time. That program will of course have to be making system calls.

But if you don't care about the "Hello World" ending up in a console, and just want to communicate data to some other process which also doesn't make any system calls, it could perhaps run timed tests, timed using x86 rdtsc or something, perhaps a microbenchmark that depends on L3 cache hits, and the data producer could either run a loop that writes a big chunk of stack space or that doesn't touch memory. Maybe spending 1 billion TSC ticks per bit of output, to give the reader time to see what it was sending. (You could come up with protocols for whatever data you actually want to send. This idea will work for any pair of CPU cores that share an L3 cache, so most desktop/laptop CPUs except for some Ryzens with more than one "core cluster" (CCX))