I want to know how setting or clearing the direction EFLAG changes how the SCAS and MOV instructions decrement or increment registers. I read some webpages and made the following assumptions I will list below.
I am using the MASM 32 SDK - no idea what version, I installed via Visual MASM's download and installation wizard - with Visual MASM to wright and MASM32 Editor to link and build them into objects and executables. I use a Windows 7 Pro 64 bit OS.
The SCAS instruction "compares a byte in AL or a word in AX with a byte or word pointed to by DI in ES." Therefore,to use SCAS, target string address must be moved to EDI and the string to find must be moved to the accumulator register (EAX and variants).
Setting direction flag then using SCAS will stop SCAS from running when using 32 bit systems. On 32 bit systems, it is impossible to force SCAS to "scan a string from the end to the start."
Any REP instruction always uses the ECX register as a counter and always decrements ECX regardless of the direction flag's value. This means it is impossible to "scan a string from the end to the beginning" using REP SCAS.
Sources:
SCAS/SCASB/SCASW, Birla Institute of Technology and Science
Scan String, from c9xm.me
SCAS/SCASB/SCASW/SCASD — Scan String, from felixcloutier.com
MASM : Using 'String' Instructions, from www.dreamincode.net/forums
Below is part of the code from a program I will refer to in my questions:
;Generic settings from MASM32 editor
.386
.model flat, stdcall
option casemap: none
.data?
Input db 254 dup(?)
InputCopy db 254 dup(?)
InputLength dd ?, 0
InputEnd dd ?, 0
.data
.code
start:
push 254
push offset Input
call StdIn
mov InputLength, eax
;---Move Last Word---
lea esi, offset Input
sub esi, 4
lea edi, offset InputEnd
movw
;---Search section---
lea esi, Input
lea edi, InputCopy
movsb
mov ecx, InputLength
mov eax, 0
mov eax, "omit"
lea edi, offset InputEnd
repne scasw
jz close ;jump if a match was found and ZF was set to 1.
Using the "Move Last Word" section, I am able to get the last byte out of the string Input. I then used MOVSW to move just the last 4 bytes of the string Input to InputEnd, assuming the direction flag is clear. I must define Input as an array of bytes - Input db 32 dup(?)
- for the block to work.
Regardless of how I define InputEnd (whether "dd ?, 0" or "db 12 dup(?)") mov and scas instructions' operation (flags set, registers modified etc.) will not change. The increment/decrement amount of SCAS and MOV are dependent on the suffix/last letter of the command, not the defined bytes or size of the pointers stored in EDI and ESI.
It is impossible to make MOVS transfer from the beginning to the end of a string. You must the length of the string; load the corresponding addresses to EDI and ESI; Add the length of the string to the addresses stored at EDI and ESI; Last, set the direction flag using std
. A danger here is targeting addresses below the source or destination bytes.
It is impossible to reverse a string's letters using MOVS since EDI and ESI are either both decremented or both incremented by MOVS.
Sources (asides from previously listed sites in SCAS section):
https://c9x.me/x86/html/file_module_x86_id_203.html
http://faydoc.tripod.com/cpu/movsd.htm
Are these assumptions correct? Is the x86 text on the sites' URLs a sign that the websites have wrong information?
First of all, repe/repne scas
and cmps
aren't fast. Also, the "fast strings" / ERMSB microcode for rep movs
and rep stos
is only fast with DF=0 (normal / forward / increasing address).
rep movs
with DF=1 is slow. repne scasw
is always slow. They can be useful in the rare case where you're optimizing for code-size, though.
The documentation you linked sets out exactly how movs
and scas
are affected by DF. Read the Operation section in Intel's manuals.
Note that it's always a post-increment/decrement so the first element compared doesn't depend on DF, only the updates to EDI and/or ESI.
Your code only depends on DF for the repne scasw
. It doesn't matter whether movsb
increments (DF=0) or decrements (DF=1) EDI because you overwrite EDI before the next use.
repne scasw
is 16-bit "word" size using AX, like it says in the HTML extracts of Intel's manual that you linked (https://www.felixcloutier.com/x86/scas:scasb:scasw:scasd). That's both the increment and the compare width.
If you want overlapping dword compares of EAX, you can't use scasw
.
You could use scasd
in a loop, but then you'd have to decrement edi
to create overlap. So really you should just use a normal cmp [edi], eax
and add edi, 2
if you only want to check even positions.
(Or preferably use SSE2 SIMD pcmpeqd
to implement memmem
for a 4-byte search "needle". Look at an optimized implementation like glibc's for ideas, or a strstr implementation but take out the checks for a 0
terminator in the "haystack".)
repne scasd
does not implement strstr or memmem, it only searches for a single element. With byte
operand size, it implements memchr
.
On 32 bit systems, it is impossible to force SCAS to "scan a string from the end to the start."
rep scas
doesn't operate on (implicit-length) C-style strings at all; it works on explicit-length strings. Therefore you can just point EDI at the last element of the buffer.
Unlike strrchr
you don't have to find the end of the string as well as the last match, you know / can calculate where the end of the string is. Perhaps calling them "strings" is the problem; the x86 rep
-string instructions actually work on known-size buffers. That's why they take a count in ECX and don't also stop on a terminating 0
byte.
Use lea edi, [buf + ecx - 1]
to set up for std
; rep scasb
. Or lea edi, [buf + ecx*2 - 2]
to set up for backwards rep scasw
on a buffer with ECX word
elements. (Generate a pointer to the last element = buf + size - 1
= buf-1 + size
)
Any REP instruction always uses the ECX register as a counter and always decrements ECX regardless of the direction flag's value. This means it is impossible to "scan a string from the end to the beginning" using REP SCAS.
This just makes zero sense. Of course it decrements; ECX=0 is how the search ends on no-match. If want to calculate position relative to the end after searching from the end, you can do length - ecx
or something like that. Or do pointer-subtraction on EDI.
6: not the data type of registers stored in EDI and ESI.
Assembly language doesn't have types; that's a higher level concept. It's up to you to do the right thing to the right bytes in asm. EDI / ESI are registers; the pointers stored in them are just integers that have no type in asm. You don't "store a register in EDI", it is a register. Maybe you meant to say "pointer store in EDI"? Registers don't have types; a bit-pattern (aka integer) in a register can be signed 2's complement, unsigned, a pointer, or whatever other interpretation you want.
But yes, any magic that MASM does based on how you defined a symbol is completely gone once you have a pointer in a register.
Remember that movsd
is just a 1-byte instruction in x86 machine code, just the opcode. It has only 3 inputs: DF, and two 32-bit integers in EDI and ESI, and they're all implicit (implied by the opcode byte). There's no other context that can affect what the hardware does. Every machine instruction has its documented effect on the architectural state of the machine; nothing more, nothing less.
7: It is impossible to make MOVS transfer from the beginning to the end of a string. ...
std
No, std
makes a transfer go backwards, from end to beginning. DF=0
is the normal / forward direction. Calling conventions guarantee / require that DF=0 on entry and exit from any function so you don't need a cld
before using string instructions; you can just assume that DF=0. (And you should normally leave DF=0.)
8: It is impossible to reverse a string's letters using MOVS since EDI and ESI are either both decremented or both incremented by MOVS.
That's correct. And a lods
/ std
/ stos
/ cld
loop is not worth it vs. a normal loop that uses dec
or sub
on one of the pointers. You can use lods
for the read part and manually write backwards. And you can go 4x faster by loading a dword and using bswap
to reverse it in a register, so you're copying in chunks of 4 reversed bytes.
Or for in-place reversal: 2 loads into tmp regs, then 2 stores, then moves the pointers towards each other until they cross. (Also works with bswap
or movbe
)
Other weird inefficiencies in your code:
mov eax, 0 ;; completely pointless, EAX is overwritten by next instruction
mov eax, "omit"
Also, lea
with a disp32
addressing mode is a pointless waste of code-size. Only use LEA for static addresses in 64-bit code, for RIP-relative addressing. Use mov esi, OFFSET Input
instead, like you're doing with push offset Input
earlier.