Search code examples
assemblyx86micro-optimization

Impact on performance when having multiple returns


Take a look at below scenario.

;some code 
 test reg1,reg2
je jump1
 ;do something
 add rsp,20
 pop rdx
 ret
jump1:
 ;do something
 cmp reg2,reg3
 jg jump2
 add rsp,20
 pop rdx
 ret
jump2:
 ;do something
 add rsp,20
 pop rdx
 ret

Similar assembles are not commonly found in disassembled codes. Perhaps compilers handle such much efficiently.

Can having multiple return statements affect performance? What are the possible performance outcomes using a single return with jmp compared to the above?


Solution

  • This is called "tail duplication" optimization. Some compilers do do this sometimes. e.g. LLVM blog post about it

    It's generally a good thing when your function epilogues are small (only 1 pop) so it doesn't cost much, especially on modern x86 with it's large caches and good code density (ret and pop are single-byte). Although if only one path through the function is expected to be "hot", maybe better to have the other one jmp to the hot one to save a small amount of uop-cache space.

    It saves one taken jmp on that path out of the function. The performance impact of that depends on the surrounding code, as always for a deeply pipelined superscalar out-of-order CPU!

    If multiple paths through a function could be hot depending on how your function is used, they can both/all be fully efficient.


    You can also do it for loops that have a branch inside the loop: duplicate the dec/jcc or whatever at the bottom of the loop instead of jumping to a common dec/jcc. (Don't forget to handle the fall-through path in both / all cases!)