Search code examples
ccompiler-construction

Why do simple programs take up so much storage space?


I created a simple hello world program in C like so:

#include <stdio.h>

int main() {
    printf("Hello World!\n");
    return 0;
}

Afterwards, I compiled it on Mac using gcc and dumped it using xxd. With 16 bytes per line (8 words), the compiled program was a total of 3073 lines or 49 424 bytes. Out of all these bytes, only 1 904 of them composed the program while the remaining 47 520 bytes were all zeros. Considering that only approximately 3.9% of the bytes are not zeros, this is a clear example of a waste of space. Is there any way to optimize the size of the executable here? (By the way, I already tried using the -Os compiler option and got no results.)

Edit: I got these numbers by counting lines of hexdump, but within the lines containing actual instructions there were also zeros. I didn't count these bytes as they may be crucial to the execution of the program. (Like the null terminator for the string Hello World!) I only counted full blocks of zeros.


Solution

  • gcc on MacOS generates object and executable files in the Mach-O file format. The file is divided up into multiple segments, each of which has some alignment requirement to make loading more efficient (hence why you get all the zero padding). I took your code and built it on my Mac with gcc, gives me an executable size of 8432 bytes. Yes, xxd gives me a bunch of zeros. Here's the objdump output of the section headers:

    $ objdump -section-headers hello
    
    hello:  file format Mach-O 64-bit x86-64
    
    Sections:
    Idx Name          Size      Address          Type
      0 __text        0000002a 0000000100000f50 TEXT 
      1 __stubs       00000006 0000000100000f7a TEXT 
      2 __stub_helper 0000001a 0000000100000f80 TEXT 
      3 __cstring     0000000f 0000000100000f9a DATA 
      4 __unwind_info 00000048 0000000100000fac DATA 
      5 __nl_symbol_ptr 00000010 0000000100001000 DATA 
      6 __la_symbol_ptr 00000008 0000000100001010 DATA 
    

    __text contains the machine code of your program, __cstring contains the literal "Hello World!\n", and there's a bunch of metadata associated with each section.

    This kind of structure is obviously overkill for a simple program like yours, but simple programs like yours are not the norm. Object and executable file formats have to be able to support dynamic loading, symbol relocation, and other things that require complex structures. There's a minimum level of complexity (and thus size) for any compiled program.

    So executable files for "small" programs are larger than you think they should be based on the source code, but realize there's a lot more than just your source code in there.