I'm starting to learn assembly, could someone please review my solution?
Exercise :
You are given eax=a, ebx=b.
Calculate (a^3) * b + 5*(b^2), and store the result inside eax.
(Here * means multiplication, and c^d means c to the power of d).
Here's my solution :
;Init the values
mov eax,3
mov ebx,2
;3,2 =>(3^3)*2 + 5*(2^2) = 27*2 + 5*4 = 54+20 = 74
;5,2 =>(5^3)*2 + 5*(2^2) = 125*2 + 5*4 = 250 + 20 = 270
;(a^3)*b
mov esi,eax
mul eax
mul esi
mul ebx
;Store the value
mov esi,eax
;5*(b^2)
mov eax,ebx
mul ebx
mov ebx,5
mul ebx
;Calculate (a^3)*b + 5*(b^2)
lea eax,[esi + eax]
Is there a way to solve this with less instructions?
Since you don't need a double-width output in EDX:EAX, instead just a 32-bit result from your 32-bit inputs, don't use slower and less convenient 1-operand mul
or imul
.
The 2-operand imul r, r
is faster because it doesn't have to compute the upper half of the product, or store it anywhere. It works for signed and unsigned numbers. It looks like you could have save a lot of mov
instructions by using the 2-operand form.
To summarize comments:
Multiplies by small constant factors are often best done with lea
. You can replace up to 4 instructions (including a mov
) with an lea
, since it gives you non-destructive operation. (dest isn't one of the sources).
As spyr03 points out,
a^3*b + 5*b^2 = b*(a^3 + 5*b)
Your final instruction:
lea eax,[esi + eax]
could have been a simple add eax, esi
, which can run on more execution ports than lea
, so it's less likely to part of a bottleneck. Only use lea
if you can't do the equivalent with one other single instruction. (Other than imul
/ mul
. Always replace those with a single lea
if possible. For latency over front-end throughput, sometimes worth using up to two LEAs to replace one imul reg, reg, 37
or whatever.)
So I might do:
mov ecx, eax
imul ecx, eax ; ecx = a^2
imul eax, ecx ; eax = a^3
lea edx, [ebx + 4*ebx] ; edx = 5*b, ebx still = b
add eax, edx ; eax = a^3 + 5*b
imul eax, ebx ; eax = b * (a^3 + 5*b)
Always comment your asm code. I like the saying that asm code can only have 2 kinds of bugs: code doesn't match comments, or comments don't describe a valid algorithm for doing the task.
Latency:
5*b
happens in parallel with a^3
, and is much faster (1 cycle). (I put it after the imul
s for a
to make sure the CPU started working on the longer dep chain ASAP.)
The long dep chain is the one involving eax
.
mov(1 or 0) -> imul(3) -> imul(3) -> add(1) -> imul(3) total = 10 cycles
(On IvyBridge and later, reg-reg moves are done at the register-renaming stage, and have zero latency.)
It's not very many instructions, though, and they're all single-uop instructions, so there's plenty of room for other stuff to happen in parallel with it.
I don't see any scope for shortening the dependency chain even at the cost of more instructions. a^3*b
and 5*b^2
could be calculated in parallel, and added at the end, but that would still be 3 multiplies in the longer of the two chains.