I like dabbling with optimization, primarily in the context of space by thinking about algorithmic flow. After experimenting with a few different scenarios, this one seems to be the best one I could dream up. Assume too, this is a must as when this procedure is called, sometimes DF will be set and other times not.
OJBECTIVE: Return DF to caller unaltered.
00 55 push bp
01 89E5 mov bp, sp
03 9C pushfw
04 FD std
... Function/Subroutine body
05 58 pop ax ; Retrieve original flags
06 F6C404 test ah, DF ; Was DF already set
09 7506 jnz Exit ; If so, nothing to do
; Reset DF without altering state of any other flags
0B 9C pushfw
0C 8066FFFB and byte [bp-1], 0FBH ; Strip DF from MSB of EFLAGS
10 9D popfw
Exit:
11 C9 leave
12 C3 ret
Excluding STD
which will actually be part of the main body, this prologue/epilogue to address objective comes in at 18 bytes.
Has anyone implemented a method that comes in tighter than this?
You have one good option:
require DF clear on function entry / exit (except for private helper functions, which can use custom calling conventions). Functions that want to use std
can then simply use cld
before any call
or ret
they make.
Using other flags as part of the return value is trivial: cld
and std
don't affect them.
In the rare case that you need to make a function call inside a loop that traverses in descending order, maybe don't use lods
or other string instructions at all. They're not magic, and the occasional dec si
/ mov al, [si]
or whatever is not a disaster for code-size. Or it means a cld
and std
inside the loop, only 1 byte each.
Most of the time you want DF clear so you're looping upwards, in which case you can have function calls inside loops without any issue. (Not all of the time, but this design is the best for the common case, and the uncommon cases can be handled without too much pain).
One fairly good option:
cld
or std
inside the same function. So this isn't really a very good option.And one mediocre option:
cld
or std
because it can't assume anything, and a save/restore of FLAGS. (Unless you optimize based on known callers and the state they put DF into).The only time this has an advantage is in a loop containing a function call that also wants DF set.
When you want to return a status in the condition codes of FLAGS as well as save/restore DF, simply put the instruction that sets condition-codes in FLAGS after the popf
that restores the caller's DF.
In functions that don't have interesting status in condition codes, simply use popf
to restore all the callers FLAGS. You don't need to merge the caller's DF into the current function's FLAGS in functions that don't return anything interesting in FLAGS.
In the rare case where you can't easily move the last string instruction before the last flag-setting instruction, it might be smaller to emulate the string instruction. Instead of inc or dec, leave flags untouched with lea si, [si +- 1]
. (Both SI and DI are valid in 16-bit addressing modes, so lea
is usable on them.)
None of lods
/ stos
/ movs
are magical, not even their rep
versions. If you don't care about performance (only code-size) you could emulate even rep movs
without touching flags, using the slow loop
instruction and a spare register (save/restore one with push
/pop
if needed)
0B 9C pushfw 0C 8066FFFB and byte [bp-1], 0FBH ; Strip DF from MSB of EFLAGS 10 9D popfw
Your code-sample doesn't restore the caller's DF, it unconditionally clears it. It's equivalent to cld
.
To restore the caller's DF, you'd need to extract the DF bit from the first pushf
, merge it into the FLAGS value from the pushf
at the end of the function, and then popf
. This is obviously possible, but significantly more inefficient (and larger in code-size) than what you show.
Also note that popf
is slow: on Haswell it's 9 uops, and has one per 18 cycle throughput. If you only care about code-size, not performance, designs that require pushf
/popf
aren't necessarily bad, but it seems to me that requiring DF clear on entry/exit will win for code size most of the time, as well as performance.
This is what 32 and 64-bit calling conventions choose for handling DF, and I don't see why it wouldn't work well in 16-bit code, too.