GNU as, puts works but printf does not

This is the code I am playing with right now:

# file-name: test.s
# 64-bit GNU as source code.
    .global main

    .section .text
main:
    lea message, %rdi
    push %rdi
    call puts

    lea message, %rdi
    push %rdi
    call printf

    push $0
    call _exit

    .section .data
message: .asciz "Hello, World!"

Compilation instructions: gcc test.s -o test

Revision 1:

    .global main
    .section .text
main:
    lea message, %rdi
    call puts

    lea message, %rdi
    call printf

    mov $0, %rdi
    call _exit

    .section .data
message: .asciz "Hello, World!"

Final Revision (Works):

    .global main
    .section .text
main:
    lea message, %rdi
    call puts

    mov $0, %rax
    lea message, %rdi
    call printf

    # flush stdout buffer.
    mov $0, %rdi
    call fflush

    # put newline to offset PS1 prompt when the program ends.  
    # - ironically, doing this makes the flush above redundant and can be removed.
    # - The call to  fflush is retained for display and 
    #      to keep the block self contained.  
    mov $'\n', %rdi
    call putchar

    mov $0, %rdi
    call _exit

    .section .data
message: .asciz "Hello, World!"

I am struggling to understand why the call to puts succeeds but the call to printf results in a Segmentation fault.

Can somebody explain this behavior and how printf is intended to be called?

Thanks ahead of time.

Summary:

printf obtains the printing string from %rdi and the number of additional arguments in %rax's lower DWORD.
printf results cannot be seen until a newline is put into stdout, or fflush(0) is called.

Solution

puts appends a newline implicitly, and stdout is line-buffered (by default on terminals). So the text from printf may just be sitting there in the buffer. Your call to _exit(2) doesn't flush buffers, because it's the exit_group(2) system call, not the exit(3) library function. (See my version of your code below).

Your call to printf(3) is also not quite right, because you didn't zero %al before calling a var-args function with no FP arguments. (Good catch @RossRidge, I missed that). xor %eax,%eax is the best way to do that. %al will be non-zero (from puts()'s return value), which is presumably why printf segfaults. I tested on my system, and printf doesn't seem to mind when the stack is misaligned (which it is, since you pushed twice before calling it, unlike puts).

(Update: newer builds of glibc will segfault in printf with misaligned RSP even with AL=0, since gcc makes more use of SSE to load or store 16 bytes at a time, and of course takes advantage of the ABI-guaranteed alignment. See an example from scanf and how to avoid it)

Also, you don't need any push instructions in that code. The first arg goes in %rdi. The first 6 integer args go in registers, the 7th and later go on the stack. You're also neglecting to pop the stack after the functions return, which only works because your function never tries to return after messing up the stack.

The ABI does require aligning the stack by 16B, and a push is one way to do that, which can actually be more efficient than sub $8, %rsp on recent Intel CPUs with a stack engine, and it takes fewer bytes. (See the x86-64 SysV ABI, and other links in the x86 tag wiki).

Improved code:

.text
.global main
main:
    lea     message(%rip), %rdi     # or  mov $message, %edi  if you don't need the code to be position-independent: default code model has all labels in the low 2G, so you can use shorter 32bit instructions
    push    %rbx              # align the stack for another call
    mov     %rdi, %rbx        # save for later
    call   puts

    xor     %eax,%eax         # %al = 0 = number of FP args for var-args functions
    mov     %rbx, %rdi        # or mov %ebx, %edi  in a non-PIE executable, since the pointer is known to be pointing to static storage which will be in the low 2GiB
    call   printf

    # optionally putchar a '\n', or include it in the string you pass to printf

    #xor    %edi,%edi    # exit with 0 status
    #call  exit          # exit(3) does an fflush and other cleanup

    pop     %rbx         # restore caller's rbx, and restore the stack

    xor     %eax,%eax    # return 0 from main is equivalent to exit(0)
    ret

    .section .rodata     # constants should go in .rodata
message: .asciz "Hello, World!"

lea message(%rip), %rdi is cheap, and doing it twice is fewer instructions than the two mov instructions to make use of %rbx. But since we needed to adjust the stack by 8B to strictly follow the ABI's 16B-aligned guarantee, we might as well do it by saving a call-preserved register. mov reg,reg is very cheap and small, so taking advantage of the call-preserved reg is natural.

Modern distros now default to making PIE executables so pointers are 64-bit even for static storage. You need RIP-relative LEA, and need 64-bit operand-size to copy them. See How to load address of function or label into register for that vs. mov $message, %edi in a non-PIE. There's never a reason to use lea message, %rdi with a 32-bit absolute addressing mode, only ever RIP-relative LEA or mov-immediate.