Is there a better method to preserve DF in function or subroutine

I like dabbling with optimization, primarily in the context of space by thinking about algorithmic flow. After experimenting with a few different scenarios, this one seems to be the best one I could dream up. Assume too, this is a must as when this procedure is called, sometimes DF will be set and other times not.

OJBECTIVE: Return DF to caller unaltered.

00  55       push   bp
01  89E5     mov    bp, sp
03  9C       pushfw

04  FD       std

    ... Function/Subroutine body

05  58       pop    ax          ; Retrieve original flags
06  F6C404   test   ah, DF      ; Was DF already set
09  7506     jnz    Exit        ; If so, nothing to do

; Reset DF without altering state of any other flags

0B  9C       pushfw
0C  8066FFFB and byte [bp-1], 0FBH  ; Strip DF from MSB of EFLAGS
10  9D       popfw

       Exit:
11  C9       leave
12  C3       ret

Excluding STD which will actually be part of the main body, this prologue/epilogue to address objective comes in at 18 bytes.

Has anyone implemented a method that comes in tighter than this?

Solution

You have one good option:

require DF clear on function entry / exit (except for private helper functions, which can use custom calling conventions). Functions that want to use std can then simply use cld before any call or ret they make.

Using other flags as part of the return value is trivial: cld and std don't affect them.

In the rare case that you need to make a function call inside a loop that traverses in descending order, maybe don't use lods or other string instructions at all. They're not magic, and the occasional dec si / mov al, [si] or whatever is not a disaster for code-size. Or it means a cld and std inside the loop, only 1 byte each.

Most of the time you want DF clear so you're looping upwards, in which case you can have function calls inside loops without any issue. (Not all of the time, but this design is the best for the common case, and the uncommon cases can be handled without too much pain).

One fairly good option:

Have all flags (including DF) call-clobbered (and used as part of the return value if you want). Every string op needs a cld or std inside the same function. So this isn't really a very good option.

And one mediocre option:

Your current calling convention where DF is call-preserved and has unknown value on function entry. Every function that uses a string instruction needs a cld or std because it can't assume anything, and a save/restore of FLAGS. (Unless you optimize based on known callers and the state they put DF into).

The only time this has an advantage is in a loop containing a function call that also wants DF set.

When you want to return a status in the condition codes of FLAGS as well as save/restore DF, simply put the instruction that sets condition-codes in FLAGS after the popf that restores the caller's DF.

In functions that don't have interesting status in condition codes, simply use popf to restore all the callers FLAGS. You don't need to merge the caller's DF into the current function's FLAGS in functions that don't return anything interesting in FLAGS.

In the rare case where you can't easily move the last string instruction before the last flag-setting instruction, it might be smaller to emulate the string instruction. Instead of inc or dec, leave flags untouched with lea si, [si +- 1]. (Both SI and DI are valid in 16-bit addressing modes, so lea is usable on them.)

None of lods / stos / movs are magical, not even their rep versions. If you don't care about performance (only code-size) you could emulate even rep movs without touching flags, using the slow loop instruction and a spare register (save/restore one with push/pop if needed)

 0B  9C       pushfw
 0C  8066FFFB and byte [bp-1], 0FBH  ; Strip DF from MSB of EFLAGS
 10  9D       popfw

Your code-sample doesn't restore the caller's DF, it unconditionally clears it. It's equivalent to cld.

To restore the caller's DF, you'd need to extract the DF bit from the first pushf, merge it into the FLAGS value from the pushf at the end of the function, and then popf. This is obviously possible, but significantly more inefficient (and larger in code-size) than what you show.

Also note that popf is slow: on Haswell it's 9 uops, and has one per 18 cycle throughput. If you only care about code-size, not performance, designs that require pushf/popf aren't necessarily bad, but it seems to me that requiring DF clear on entry/exit will win for code size most of the time, as well as performance.

This is what 32 and 64-bit calling conventions choose for handling DF, and I don't see why it wouldn't work well in 16-bit code, too.