Search code examples
cassemblynotepad++sublimetext3machine-code

Opening a simple .exe file in notepad++ vs in Sublime Text 3 yields very different results


I compiled the following C code with GCC for windows 10 (mingw-64) :

#include <stdio.h>
int main(){
    printf("Hello World!");
    return 0;
}

with the command

gcc.exe -o test test.c

It works because when I execute the resulting file I do get a Hello World! in the console, however I am surprised because when I open test.exe in notepad++ it is 220 lines long with some readable text in it such as

Address %p has no image-section VirtualQuery failed for %d bytes at address %p

and also

Unknown pseudo relocation protocol version %d. Unknown pseudo relocation bit size %d.

However when I open the same file in Sublime Text 3, I get over 3300 lines of just some seemingly random numbers and letters such as :

4d5a 9000 0300 0000 0400 0000 ffff 0000
b800 0000 0000 0000 4000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 8000 0000
0e1f ba0e 00b4 09cd 21b8 014c cd21 5468
6973 2070 726f 6772 616d 2063 616e 6e6f
7420 6265 2072 756e 2069 6e20 444f 5320
6d6f 6465 2e0d 0d0a 2400 0000 0000 0000
5045 0000 6486 0f00 5aca 455d 0068 0000
9304 0000 f000 2700 0b02 021e 001e 0000
0038 0000 000a 0000 e014 0000 0010 0000
0000 4000 0000 0000 0010 0000 0002 0000
0400 0000 0000 0000 0500 0200 0000 0000
0020 0100 0004 0000 0e3e 0100 0300 0000
0000 2000 0000 0000 0010 0000 0000 0000
0000 1000 0000 0000 0010 0000 0000 0000
0000 0000 1000 0000 0000 0000 0000 0000
0080 0000 6c07 0000 0000 0000 0000 0000
0050 0000 7002 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

I also tried to get the assembly version and this one is the same in notepad and sublime :

    .file   "test.c"
    .text
    .def    __main; .scl    2;  .type   32; .endef
    .section .rdata,"dr"
.LC0:
    .ascii "Hello World!\0"
    .section    .text.startup,"x"
    .p2align 4,,15
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    subq    $40, %rsp    #,
    .seh_stackalloc 40
    .seh_endprologue
 # test.c:2: int main(){
    call    __main   #
 # test.c:3:    printf("Hello World!");
    leaq    .LC0(%rip), %rcx     #,
    call    printf   #
 # test.c:5: }
    xorl    %eax, %eax   #
    addq    $40, %rsp    #,
    ret 
    .seh_endproc
    .ident  "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"
    .def    printf; .scl    2;  .type   32; .endef

First question :

why is the output different in sublime text and notepad ?

Second question :

where are the 0s and 1s , I thought machine code was only 0s and 1s ?

Third question :

how come it's 3300 lines for just a simple hello world, doesnt that sound grossly inefficient?

Thanks for any insight!


Solution

  • An .exe file is a binary file. Most of it is non-printable, non-human readable bytes. So your question actually boils down to, why are these two text editors doing two different things with a non-text file which they're not even designed to manipulate in the first place?

    Buried within a binary file may be some human-readable strings. First of all, some fraction of the bytes in a binary file will be, by chance, in the printable set. Also, computer programs that contain text strings like "Can't open file" will typically end up containing those strings embedded, literally, in their binaries.

    Typically, a text editor displays a binary file as garbage. Typically, it displays those printable characters it knows about, indiscriminately intermixed with "funny" representations of the nonprintable characters. (On Windows platforms, at least, it's not unusual for the nonprinting characters to be displayed using a mapping to the old MS-DOS character set, which did have special graphics characters in many of the nonprintable positions.) It looks like that's what Notepad is doing.

    It looks like Sublime is noticing that the file is binary, and converting every byte in it to hexadecimal. That means you can't immediately see the printing characters, but you can uniformly see (as hexadecimal) all the characters, the printable and the nonprintable, side by side.

    To make this more clear, let's look at a slightly different case. Consider this program:

    #include <stdio.h>
    
    int main()
    {
        char binary[] = "\1\2\3Hello\4\5\6World\x1E\x1F\x20\x21";
        fwrite(binary, 1, sizeof(binary), stdout);
    }
    

    This program prints a mixture of text and binary characters to its standard output. If you compile and run this program and redirect its output to a file, you'll end up with a file with a mixture of text and binary characters in it, just like (in this respect) your .exe file.

    If I print the output of this program in my normal environment, I get:

    HelloWorld !
    

    We can see the printable strings Hello and World as we might have expected, and a ! character as we might not have expected. In my normal environment, the unprintable characters print as nothing at all.

    If I printed the output of this program in an MS-DOS environment (where, as I mentioned, a lot of those theoretically "unprintable" characters did have graphic representations), we might see something like

    ☺☻♥Hello♦♣♠World▲▼ !
    

    If I run this program through a program that converts every byte to its hexadecimal representation, I get

    01020348656C6C6F040506576f726C641E1F202100
    

    Let's look at this carefully. It starts with hex 010203, which clearly corresponds to the leading "\1\2\3" of the string. Next comes 48656C6C6F, which if you look them up are the hexadecimal ASCII codes for the string "Hello". Next comes 040506, which corresponds to the "\4\5\6" part. Next comes 576F726C64, which is, you guessed it, "World". Next comes 1E1F2021, which is of course the final "\x1E\x1F\x20\x21". Finally, at the very end, there's 00, which is the '\0' character which the compiler automatically appended to the end of the string in the binary array.

    You've probably figured this out, but hex 20 and 21 are the ASCII codes (hexadecimal) for the space and ! characters, so that's what those were doing in the output.

    If I run the output through the Unix/Linux command cat -v, which makes the nonprintable characters visible using a "control character" representation ^X, I get:

    ^A^B^CHello^D^E^FWorld^^^_ !^@
    

    Finally, here's one more representation of the output, run through a "hex dump" program which shows both the hexadecimal and text representations, side by side, but with nonprintable characters replaced by dots:

    01 02 03 48 65 6c 6c 6f  04 05 06 57 6f 72 6c 64   ...Hello...World
    1e 1f 20 21 00                                     .. !.