I have this problem where I need to calculate function f(x)=2*(x^2)+5 with MMX instruction set. I have two problems. This is my code for now:
section .data
print_fmt db '%d', 10, 0
my_loaded_data times 128 dw 0
fives times 4 dw 5
twos times 4 dw 2
my_loaded_data_file_name db 'test_numbers.bin', 0
mod_f db 'r', 0
section .text
extern printf, fopen, fread
global main
main:
PUSH rbp
mov rbp, rsp
mov rax, 0
mov rdi, my_loaded_data_file_name
mov rsi, mod_f
call fopen
cmp rax, 0
je .end
PUSH rax
PUSH rdi
PUSH rsi
PUSH rdx
PUSH rcx
mov rdi, my_loaded_data
mov rsi, 2
mov rdx, 128
mov rcx, rax
mov rax, 0
call fread
POP rcx
POP rdx
POP rsi
POP rdi
POP rax
mov rsi, my_loaded_data
mov rcx, 32
jmp .square_loop
.square_loop:
movq mm0, [rsi]
movq mm1, [rsi]
pmulhw mm0, mm1
movq [rsi], mm0
add rsi, 8
loop .square_loop
mov rcx, 32
mov rsi, my_loaded_data
movq mm1, [twos]
jmp .mult_with2_loop
.mult_with2_loop:
movq mm0, [rsi]
pmulhw mm0, mm1
movq [rsi], mm0
add rsi, 8
loop .mult_with2_loop
mov rcx, 32
mov rsi, my_loaded_data
movq mm1, [fives]
jmp .add_five_loop
.add_five_loop:
movq mm0, [rsi]
paddusw mm0, mm1
movq [rsi], mm0
add rsi, 8
loop .add_five_loop
jmp .write_it
.write_it:
mov r8, my_loaded_data
mov rcx, 128
.write_loop:
mov rax, 0
mov ax, [r8]
PUSH r8
PUSH rcx
PUSH rdi
PUSH rsi
PUSH rax
mov rdi, print_fmt
mov rsi, rax
mov rax, 0
call printf
POP rax
POP rsi
POP rdi
POP rcx
POP r8
add r8, 2
loop .write_loop
.end:
mov rax, 0
POP rbp
ret
My first problem is the multiplication instruction. Which instruction do I use for saturation. First I though there would be a instruction like pmulsw
but it seems there is not. pmulhw
saves the upper 16-bits of a 32-bit result. I can't find any instruction that would give a 16-bit result. Is it the only way to save 32-bit results?
Second problem is with printf. It keeps giving segmentation fault and I don't know why. This is from my terminal:
Program received signal SIGSEGV, Segmentation fault.
__GI___tcgetattr (fd=1, termios_p=termios_p@entry=0x7ffffffed9a8) at ../sysdeps/unix/sysv/linux/tcgetattr.c:42
42 ../sysdeps/unix/sysv/linux/tcgetattr.c: No such file or directory.
Here is the makefile:
zad7.o: zad7.asm
nasm -f elf64 -g -F dwarf zad7.asm -o zad7.o
zad7: zad7.o
gcc -o zad7 zad7.o -no-pie -ggdb
And for your convenience here is a little C program which can generate the binary file for reading:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
void make_data_file(unsigned int number_of_data, unsigned int size_in_bytes, char* file_name)
{
FILE *write_ptr = fopen(file_name, "wb");
if(write_ptr==NULL)
{
printf("Error creating file '%s'.\n", file_name);
return;
}
double n_bits = size_in_bytes * 8;
unsigned int max_num = pow(2, n_bits);
unsigned short random_number;
for(int i=0; i< number_of_data; i++)
{
random_number = i;
fwrite(&random_number, size_in_bytes, 1, write_ptr);
}
fclose(write_ptr);
}
int main()
{
make_data_file(128, 2, "test_numbers.bin");
return 0;
}
If you care about performance, avoid the loop
instruction on modern CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also use SSE2 instead of MMX; your array size is a multiple of 16 as well as 8, and you're using x86-64 which is guaranteed to have SSE2. MMX is totally pointless for this unless you're also making a 32-bit version for Pentium III / Athlon-XP and earlier.
(All the code in my answer will work with 8-byte MMX registers instead of 16-byte XMM registers, because there are MMX versions of all the instructions I used. According to the appendix B to the NASM manual, pmullw
, pxor
, pcmpgtw
, and paddusw
were all available in original P5 Pentium MMX. Some of the instructions that Intel's manual lists as "MMX" (like pmulhuw
and pshufw
) were only added later, like with Pentium II maybe, or along with SSE in Pentium III, but that's not the case for any of the instructions that were useful here.)
See https://stackoverflow.com/tags/x86/info for performance / optimization guides, and also ABI / calling convention links that explain the 16-byte stack alignment required for calling functions.
mov rax, 0
/ mov ax, [r8]
is really silly. Use movzx eax, word [r8]
like a normal person. You also don't need to jmp
to the next source line, like jmp .square_loop
/ .square_loop:
Execution always falls through to the next line on its own if there isn't a branch instruction.
x86 SIMD doesn't have a saturating multiply, only saturating signed/unsigned addition and saturating packing to narrower elements. (MMX/SSE2 paddsw
/ paddusw
). Since you print with %d
, maybe you want signed saturation? But that's only after unpacking to 32-bit, and your formula will always produce a positive result, so you could use unsigned saturation. I see that's what your code is using paddusw
.
Also, using 3 separate loops that store/reload your data between each step of the formula is really terrible. You (almost) always want increase computational intensity (amount of ALU work done on your data per memory/cache bandwidth). You also don't need a multiply instruction to double a number: just add it to itself. padd*
runs on more ports than pmul*
, and has better latency and (relevant in this case) throughput.
default rel
;;; Efficient but *doesn't* saturate the multiply
lea rcx, [rsi + length] ; rcx = end_pointer
movdqa xmm5, [fives]
.loop: ; do{
movdqu xmm0, [rsi] ; use movdqa if your data is aligned, faster on very old CPUs.
pmullw xmm0, xmm0 ; x*x ; does NOT saturate. will wrap.
paddusw xmm0, xmm0 ; *= 2 ; with unsigned saturation
paddusw xmm0, xmm5 ; += 5 ; xmm5 = _mm_set1_epi16(5) outside the loop
movdqu [rsi], xmm0
add rsi, 16
cmp rsi, rcx ; }while(p<endp);
jb .loop
...
section .rodata
align 16
fives: times 8 dw 5
For saturation, you might be able to use SSSE3 https://www.felixcloutier.com/x86/pmaddubsw, but that only takes byte inputs. It saturates the horizontal sum of pairs of i8 x u8 => i16 products.
Otherwise you probably have to unpack to dwords and packssdw
(signed) or packusdw
(unsigned saturation) back to words. But dword multiply is slow with SSE4.1 pmulld
(2 uops on Haswell and later). It's actually only 1 uop on some older CPUs, though. And of course unpacking creates twice as much work from having wider elements.
In this case, your formula is monotonic with the magnitude of the input, so you can just compare on the input and saturate manually.
If we assume your inputs are also unsigned, we don't have to do an absolute value before comparing. But (until AVX512) we don't have unsigned integer compares, only signed greater-than, so large unsigned inputs will compare as negative.
2 * 0x00b5^2 + 5 = 0xfff7
which fits in 16 bits2 * 0x00b6^2 + 5 = 0x102cd
which doesn't, and we want it saturated to 0xffff
The overflow cutoff point is an even number, so we could deal with the signed-compare issue by right shifting. That would be unsigned division by 2, letting use safely treat the result as a non-negative signed integer. 0xb6 >> 1 = 0x5b
. But pcmpgtw
is a compare for >
, not >=
.
If we reverse the operands to pcmpgtw
to compare for (x>>1) < (0xb6>>1)
, then we'd have to movdqa
the constant to avoid destroying it, but we still need to right-shift x
with a movdqa+psrlw. And it's more efficient to have a vector that's 0xffff
when saturation is needed (else 0), because we can apply that directly with a POR or PADDUSW.
So our best bet is simply to range-shift both x and 0xb5
to signed, and do (x-0x8000) > (0xb5 - 0x8000)
using the pcmpgtw
SIMD signed compare.
Other worse options include:
pmulhuw
to calculate the high half (and check if it's non-zero). We'd be in danger of bottlenecking on multiply throughput, and checking for non-zero with pcmpeqw
is the inverse of the condition we want.psubusw x, 0xb5
and check that it's == 0. pcmpeqw
would give us an inverted mask, but we can't use pcmpgtw
to check for usat16(x-0xb5) > 0
because large inputs (with the high bit set) will stay "negative" after subtracting only a small number like 0xb5
.paddusw
and check for == 0xffff
: only small-enough inputs would leave the final result not saturated. Some CPUs can run pxor
on more ports than padd*
, and it doesn't require any fewer non-zero vector constants, so this is not better in any way. But it is equally good on Skylake.default rel
;;; With a check on the input to saturate the output
lea rcx, [rsi + length] ; rcx = end_pointer
movdqa xmm4, [input_saturation_cutoff]
movdqa xmm5, [fives]
pcmpeqd xmm3, xmm3
psllw xmm3, 15 ; xmm3 = set1_epi16(0x8000) for range-shifting to signed
.loop:
movdqu xmm0, [rsi]
movdqa xmm1, xmm0
; if x>0xb5 (unsigned), saturate output to 0xffff
pxor xmm1, xmm3 ; x - 0x8000 = range shift to signed for compare
pcmpgtw xmm1, xmm4 ; xmm1 = (x > 0xb5) ? -1 : 0
pmullw xmm0, xmm0 ; x*x
paddusw xmm0, xmm0 ; *= 2 ; saturating add or not doesn't matter here
por xmm1, xmm5 ; 0xffff (saturation needed) else 5. Off the critical path to give OoO exec an easier time.
paddusw xmm0, xmm1 ; += 5 or += 0xffff with unsigned saturation.
movdqu [rsi], xmm0
add rsi, 16
cmp rsi, rcx
jb .loop
...
section .rodata
align 16
input_saturation_cutoff: times 8 dw (0x00b5 - 0x8000) ; range-shifted to signed for pcmpgtw
fives: times 8 dw 5
; 5 = 0xb6 >> 5 or 0xb6 >> 5 but we'd need to knock off the high bit from input_saturation_cutoff
; Or we could materialize constants like this:
; movdqa xmm4, [set1w_0xb5]
; pcmpeqd xmm3, xmm3
; psllw xmm3, 15 ; rangeshift_mask = set1(0x8000)
; movdqa xmm5, xmm4
; psrlw xmm5, 5 ; fives = set1(5)
; pxor xmm4, xmm3 ; input_sat_cutoff = set1(0x80b5)
;; probably not worth it since we already need to load 1 from memory.
I tested this, and paddusw
does do 0x2 + 0xffff = 0xffff
for example.
We could just POR the final result with the 0 or 0xffff to either leave it unmodified or set it to 0xffff, but modifying the input to the last paddusw
creates more instruction-level parallelism within one iteration. So Out-of-Order Execution doesn't have to overlap as many independent iterations to hide the latency of the loop body. (If we were actually scheduling this for in-order Atom or P5 Pentium-MMX, we'd interleave more of the two dep chains.)
Actually, right-shifting by 1 would work: we only need the compare to catch inputs so large that the multiply wraps to a small result. 0xb6 * 0xb6
doesn't wrap, so it saturates on its own just fine from paddubsw
.
It's fine if we check (x>>1) > (0xb6>>1)
with pcmpgtw
(instead of >=
) to catch inputs like 0xffff
(where pmullw with 0xffff gives us 0x0001). So we could save 1 vector constant, but otherwise this is not better.
pxor
+ pcmpgtw
is at least as cheap as psrlw xmm, 1
+ pcmpgtw
, unless maybe we're tuning for Intel P6-family (Core2/Nehalem) and running into register-read-port stalls. But that's unlikely: xmm0, xmm1, and rsi should always be hot (recently written and thus read from the ROB, not the permanent register file). We only read 2 vector constants in the first group of 4 instructions in the loop, then 1 later.
As I say below, on many Intel CPUs, psrlw
can only run on the same port as pmullw
, on the vec-int shift+multiply execution unit. Probably not a throughput bottleneck here, though.
But pcmp
and padd
run on limited ports (on Intel before Skylake), while pxor
can run on any vector ALU port. A mix of purely padd
/pcmp
/pmul
/psrl` uops would leave one vector ALU port unused.
(I wrote this part forgetting about the *2 in the formula when I looked for the highest input that wouldn't overflow.)
If the formula had been (0x00ff)^2 + 5
, the saturation check would be simpler.
We could just check bit-positions.
(0x00ff)^2 + 5 = 0xfe06
which fits in 16 bits(0x0100)^2 + 5 = 0x10005
which doesn't, and we want it saturated to 0xffff
So we need to check that the upper 16 bits are all zero, or that x&0xFF == x
, or that x>>8 == 0
.
This needs fewer constants, but is actually worse than range-shifting to signed with PXOR, because on some CPUs the vector-shift and vector-multiply execution units are on the same port. (And thus psrlw
and pmullw
compete with each other for throughput. This is enough total uops that we wouldn't actually bottleneck on port 0 on Nehalem / Sandybridge / Haswell, but it doesn't hurt.)
lea rcx, [rsi + length] ; rcx = end_pointer
movq xmm5, [fives]
punpcklqdq xmm5, xmm5 ; or with SSE3, movddup xmm5, [fives] to broadcast-load
pxor xmm4, xmm4 ; xmm4 = 0
.loop:
movdqu xmm0, [rsi]
movdqa xmm1, xmm0
; if x>0xffU, i.e. if the high byte >0, saturate output to 0xffff
psrlw xmm1, 8 ; x>>8 (logical right shift)
pcmpgtw xmm1, xmm4 ; xmm1 = ((x>>8) > 0) ? -1 : 0
pmullw xmm0, xmm0 ; x*x ; does NOT saturate. will wrap.
paddusw xmm0, xmm0 ; *= 2 ; with unsigned saturation
por xmm1, xmm5 ; 0xffff (saturation needed) or 5 (normal). Off the critical path to give OoO exec an easier time.
paddusw xmm0, xmm1 ; += 5 or += 0xffff with unsigned saturation.
movdqu [rsi], xmm0
add rsi, 16
cmp rsi, rcx
jb .loop
We can finally do unsigned integer compares with AVX512F, and on word element size with AVX512BW. But the result is in a mask register instead of a vector, so we can't just vpor
it with the vector of set1(5)
to create an input for saturating add.
Instead, we can blend between a vector of 5
and 0xffff
, according to the compare mask.
;; AVX512BW version
;; On a Skylake Xeon you probably only want to use YMM regs unless you spend a lot of time in this
;; to avoid reducing max turbo much.
;; Even with xmm or ymm regs (AVX512VL + BW), this demonstrates
;; that we gain even more efficiency than just widening the vectors
;; Just having non-destructive AVX encodings saves the `movdqa xmm1,xmm0` in the SSE2 version.
;; With YMM or XMM regs, most of these instructions can still use shorter VEX encoding (AVX), not the longer EVEX (AVX512)
;; (Use vmovdqa/u instead of vmovdqu64. The 64 is element-size, irrelevant when not masking.)
;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;;;;;;
mov eax, 0xb5 ;; materialize vector constants from an immediate
vpbroadcastd zmm4, eax ; largest input that won't overflow/saturate
vpsrlw zmm5, zmm4, 5 ; fives = 0xb5 >> 5 = 5
;vpcmpeqd xmm3, xmm3 ; all-ones: result on saturation
vpternlogd zmm3,zmm3,zmm3, 0xff ; alternative for ZMM regs, where there's no compare-into-vector
.loop:
; alignment recommended for 512-bit vectors, but `u` lets it still work (slower) on unaligned.
vmovdqu64 zmm0, [rsi]
;; if x>0xb5 (unsigned), saturate output to 0xffff
vpcmpuw k1, zmm0, zmm4, 2 ; k1 = x <= 0xb5. 2 = LE predicate
; k1 set for elements that WON'T overflow
vpmullw zmm0, zmm0 ; x*x
vpaddusw zmm0, zmm0 ; *= 2 ; saturating add or not doesn't matter here
vmovdqa64 zmm1, zmm3 ; 0xffff
vpaddusw zmm1{k1}, zmm0, zmm5 ; 0xffff or 2*x^2 + 5 merge masking
vmovdqu64 [rsi], zmm1
add rsi, 64
cmp rsi, rcx
jb .loop
(NASM allows vpmullw a, b
as a shortcut for vpaddusw a, a, b
, when you don't want to take advantage of the non-destructive destination 3-operand encoding, like it does for imul eax, 123
.)
An earlier idea for applying the saturation was to use vpblendmw
to select between vectors of 5
and 0xffff
according to a mask.
vpcmpuw k1, xmm4, xmm0, 1 ; k1 = 0xb5<x = x>0xb5. 1 = LT predicate numerically because NASM doesn't seem to accept vpcmpltuw the way it accepts vpcmpeqb
; k1 = 1 for elements that WILL overflow.
multiply and add as usual
...
vpblendmw xmm1{k1}, xmm5, xmm3 ; select (k1==0) ? 5 : 0xffff
vpaddusw xmm0, xmm1 ; += 5 or += 0xffff with unsigned saturation.
Copying a register still takes a front-end uop, but not a back-end ALU uop. So (especially for 512-bit registers where port 1 shuts down for vector uops on SKX), this vpblendmb
idea is worse than copy and merge-mask.
Besides that, IACA thinks vpblendmw xmm1{k1}, xmm3, xmm5
has an output dependency on XMM1, even though it's actually write only. (I tested by putting 8 of it in a loop, with/without a dep-breaking vpxor
). Mask-blend instructions are a special case: for unset mask bits means it copies from src1 (or zero for zero-masking), while for set mask bits it copies from src2.
But the machine encoding uses merge-masking, so it's possible the HW would treat it like any other ALU operation with merge-masking. (Where the output vector is a 3rd input dependency, vpaddw xmm1{k1}, xmm2, xmm3
: if mask has any zeros, the result in XMM1 will be the input value of that element.)
This is probably not a problem: according to IACA, SKX can run this at one iteration per 2.24 cycles (bottlenecked on the front-end), so a loop-carried dep chain through XMM1 isn't a problem as long as it's only 1 cycle latency. (If unrolling to reducing loop overhead / front-end bottlenecks, you should use a separate vector for the blend destinations to decouple iterations, even though you can't get it down anywhere near 1 cycle per iteration.)
(And the version using copy + merge-masking into a vector of 0xffff
also runs at that throughput, even for ZMM vectors. But IACA thinks the vpblendmb version would be slower with ZMM, even though it says both bottleneck on the front-end...)