Search code examples
x86cpucompiler-optimizationdecompiler

Do compilers put data inside .text section of PE or ELF files? if so, why?


So there was a question asked about this a while ago :

Why do Compilers put data inside .text(code) section of the PE and ELF files and how does the CPU distinguish between data and code?

but the top answer of that says there is no data in text section and compilers don't do that!

but i have encountered some binaries which while debugging in ollydbg i have seen some weird bytes in .text, which are probably data i guess, and i still read papers that claim data could be inside .text section

this is actually the cause of static disassembly being an undecidable problem (at least academic papers claim it to be) because they say data could be inside text section and we can never know

so i want to put this question to rest once and for all, and please provide a source if you want to answer this :

  1. Do compilers put data inside .text section? if so, which compilers and versions of compilers do you know that do this ?

  2. If they do this, why is that? i read the answer given to the question i linked, but i couldn't understand it since I'm not really an expert on hardware, so can you guys provide a simpler explanation, something a software developer can understand?

here's another source for saying we cannot distinguish data and code in executables :

https://www.usenix.org/legacy/publications/library/proceedings/usenix03/tech/full_papers/prasad/prasad_html/node5.html

distinguishing code from data in a binary file is a fundamentally undecidable problem


Solution

  • For x86, gcc/clang/ICC/MSVC don't mix data with code because it's pointless, like I said in my answer on the linked question. (Not counting immediate data, which would decode as part of an instruction, obviously). The end of the .text section and the start of the .rodata section might be adjacent inside the TEXT segment, but that's not what you mean.

    For non-x86 ELF binaries (e.g. ARM), they do mix code and read-only data to allow PC-relative loads with only 12 bits or smaller offsets that fit into a fixed-width load instruction.

    Obfuscated x86 binaries certainly might mix in some data, or just make disassembly hard so it looks like there might be some. Static disassembly is normally easy on compiler-generated code that hasn't been intentionally obfuscated. Anything that confuses disassembly can make it look like possible data. And yes, it's undecidable.


    Nowhere in my linked answer did I say that binaries with mixed code + constants don't exist. I only said that normal optimizing compilers don't do it, and that it has no performance advantages. Only anti-reverse-engineering advantages, at a small cost in performance assuming the data is read-only. (Or a very large cost if data is read/write.)

    Binary obfuscation is a real thing that people use on commercial software. I'm not at all surprised that you've found binaries in the wild that don't disassemble cleanly. But this is done after compiling, making a new obfuscated binary from compiler output. (Or maybe with compiler plugins? I'm really not sure). But it's not the compiler proper that's doing it, that's a later step in the build toolchain. People that sell binary-obfuscation software are selling a binary->binary converter, not a compiler, I think.

    I've never had any trouble disassembling gcc/clang output on any Linux distro (e.g. stuff in /usr/bin or /usr/lib). Without debug symbols you get huge blocks of instructions, but disassembly doesn't get out of sync with how execution would reach it. Padding between functions is long NOPs that decode sanely after the ret or jmp at the bottom of a function. Or with MSVC, the padding is single-byte int3 instructions that again don't desync decoding of the start of the next function the way 00 00 bytes (add [rax], al) would.

    Notice the difference between your claim (that obfuscated binaries exist) vs. the much much stronger claim made in the paper linked from the other question (that optimizing compilers do this aggressively for performance reasons including on x86).

    If you want to implement binary-rewriting that must work for every binary, then yes you have a huge problem. But if you only have to care about non-obfuscated compiler output, it's significantly easier.