Calculate f(x)=2*(x^2)+5 with saturation using MMX instruction set for 128 numbers with size of 2 bytes loaded from a binary file

I have this problem where I need to calculate function f(x)=2*(x^2)+5 with MMX instruction set. I have two problems. This is my code for now:

section .data
    print_fmt db '%d', 10, 0
    my_loaded_data times 128 dw 0
    fives times 4 dw 5
    twos times 4 dw 2
    my_loaded_data_file_name db 'test_numbers.bin', 0
    mod_f db 'r', 0
section .text
extern printf, fopen, fread
global main
main:
    PUSH rbp
    mov rbp, rsp

    mov rax, 0
    mov rdi, my_loaded_data_file_name
    mov rsi, mod_f
    call fopen
    cmp rax, 0
    je .end

    PUSH rax
    PUSH rdi
    PUSH rsi
    PUSH rdx
    PUSH rcx

    mov rdi, my_loaded_data
    mov rsi, 2
    mov rdx, 128
    mov rcx, rax
    mov rax, 0
    call fread

    POP rcx
    POP rdx
    POP rsi
    POP rdi
    POP rax

    mov rsi, my_loaded_data
    mov rcx, 32
    jmp .square_loop

.square_loop:
    movq mm0, [rsi]
    movq mm1, [rsi]
    pmulhw mm0, mm1
    movq [rsi], mm0
    add rsi, 8
    loop .square_loop

    mov rcx, 32
    mov rsi, my_loaded_data
    movq mm1, [twos]
    jmp .mult_with2_loop

.mult_with2_loop:
    movq mm0, [rsi]
    pmulhw mm0, mm1
    movq [rsi], mm0
    add rsi, 8
    loop .mult_with2_loop

    mov rcx, 32
    mov rsi, my_loaded_data
    movq mm1, [fives]
    jmp .add_five_loop

.add_five_loop:
    movq mm0, [rsi]
    paddusw mm0, mm1
    movq [rsi], mm0
    add rsi, 8
    loop .add_five_loop
    jmp .write_it

.write_it:
    mov r8, my_loaded_data
    mov rcx, 128
.write_loop:
    mov rax, 0
    mov ax, [r8]

    PUSH r8
    PUSH rcx
    PUSH rdi
    PUSH rsi
    PUSH rax
    mov rdi, print_fmt
    mov rsi, rax
    mov rax, 0
    call printf
    POP rax
    POP rsi
    POP rdi
    POP rcx
    POP r8

    add r8, 2
    loop .write_loop
.end:
    mov rax, 0
    POP rbp
    ret

My first problem is the multiplication instruction. Which instruction do I use for saturation. First I though there would be a instruction like pmulsw but it seems there is not. pmulhw saves the upper 16-bits of a 32-bit result. I can't find any instruction that would give a 16-bit result. Is it the only way to save 32-bit results?

Second problem is with printf. It keeps giving segmentation fault and I don't know why. This is from my terminal:

Program received signal SIGSEGV, Segmentation fault.
__GI___tcgetattr (fd=1, termios_p=termios_p@entry=0x7ffffffed9a8) at ../sysdeps/unix/sysv/linux/tcgetattr.c:42
42      ../sysdeps/unix/sysv/linux/tcgetattr.c: No such file or directory.

Here is the makefile:

zad7.o: zad7.asm
    nasm -f elf64 -g -F dwarf zad7.asm -o zad7.o
zad7: zad7.o
    gcc -o zad7 zad7.o -no-pie -ggdb

And for your convenience here is a little C program which can generate the binary file for reading:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void make_data_file(unsigned int number_of_data, unsigned int size_in_bytes, char* file_name) 
{
    FILE *write_ptr = fopen(file_name, "wb");
    if(write_ptr==NULL) 
    {
        printf("Error creating file '%s'.\n", file_name);
        return;
    }

    double n_bits = size_in_bytes * 8;
    unsigned int max_num = pow(2, n_bits);
    unsigned short random_number;
    for(int i=0; i< number_of_data; i++) 
    {
        random_number = i;
        fwrite(&random_number, size_in_bytes, 1, write_ptr);
    }
    fclose(write_ptr);
}

int main() 
{
    make_data_file(128, 2, "test_numbers.bin");
    return 0;
}

Solution

If you care about performance, avoid the loop instruction on modern CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?. Also use SSE2 instead of MMX; your array size is a multiple of 16 as well as 8, and you're using x86-64 which is guaranteed to have SSE2. MMX is totally pointless for this unless you're also making a 32-bit version for Pentium III / Athlon-XP and earlier.

(All the code in my answer will work with 8-byte MMX registers instead of 16-byte XMM registers, because there are MMX versions of all the instructions I used. According to the appendix B to the NASM manual, pmullw, pxor, pcmpgtw, and paddusw were all available in original P5 Pentium MMX. Some of the instructions that Intel's manual lists as "MMX" (like pmulhuw and pshufw) were only added later, like with Pentium II maybe, or along with SSE in Pentium III, but that's not the case for any of the instructions that were useful here.)

See https://stackoverflow.com/tags/x86/info for performance / optimization guides, and also ABI / calling convention links that explain the 16-byte stack alignment required for calling functions.

mov rax, 0 / mov ax, [r8] is really silly. Use movzx eax, word [r8] like a normal person. You also don't need to jmp to the next source line, like jmp .square_loop / .square_loop: Execution always falls through to the next line on its own if there isn't a branch instruction.

x86 SIMD doesn't have a saturating multiply, only saturating signed/unsigned addition and saturating packing to narrower elements. (MMX/SSE2 paddsw / paddusw). Since you print with %d, maybe you want signed saturation? But that's only after unpacking to 32-bit, and your formula will always produce a positive result, so you could use unsigned saturation. I see that's what your code is using paddusw.

Also, using 3 separate loops that store/reload your data between each step of the formula is really terrible. You (almost) always want increase computational intensity (amount of ALU work done on your data per memory/cache bandwidth). You also don't need a multiply instruction to double a number: just add it to itself. padd* runs on more ports than pmul*, and has better latency and (relevant in this case) throughput.

default rel
  ;;; Efficient but *doesn't* saturate the multiply

    lea      rcx, [rsi + length]      ; rcx = end_pointer
    movdqa   xmm5, [fives]

.loop:                      ; do{
    movdqu   xmm0, [rsi]               ; use movdqa if your data is aligned, faster on very old CPUs.
    pmullw   xmm0, xmm0      ; x*x      ; does NOT saturate.  will wrap.
    paddusw  xmm0, xmm0      ; *= 2    ; with unsigned saturation
    paddusw  xmm0, xmm5      ; += 5    ; xmm5 = _mm_set1_epi16(5) outside the loop

    movdqu   [rsi], xmm0
    add      rsi, 16
    cmp     rsi, rcx        ; }while(p<endp);
    jb     .loop
    ...

section .rodata
align 16
  fives: times 8 dw 5

For saturation, you might be able to use SSSE3 https://www.felixcloutier.com/x86/pmaddubsw, but that only takes byte inputs. It saturates the horizontal sum of pairs of i8 x u8 => i16 products.

Otherwise you probably have to unpack to dwords and packssdw (signed) or packusdw (unsigned saturation) back to words. But dword multiply is slow with SSE4.1 pmulld (2 uops on Haswell and later). It's actually only 1 uop on some older CPUs, though. And of course unpacking creates twice as much work from having wider elements.

In this case, your formula is monotonic with the magnitude of the input, so you can just compare on the input and saturate manually.

If we assume your inputs are also unsigned, we don't have to do an absolute value before comparing. But (until AVX512) we don't have unsigned integer compares, only signed greater-than, so large unsigned inputs will compare as negative.

2 * 0x00b5^2 + 5 = 0xfff7 which fits in 16 bits
2 * 0x00b6^2 + 5 = 0x102cd which doesn't, and we want it saturated to 0xffff

The overflow cutoff point is an even number, so we could deal with the signed-compare issue by right shifting. That would be unsigned division by 2, letting use safely treat the result as a non-negative signed integer. 0xb6 >> 1 = 0x5b. But pcmpgtw is a compare for >, not >=.

If we reverse the operands to pcmpgtw to compare for (x>>1) < (0xb6>>1), then we'd have to movdqa the constant to avoid destroying it, but we still need to right-shift x with a movdqa+psrlw. And it's more efficient to have a vector that's 0xffff when saturation is needed (else 0), because we can apply that directly with a POR or PADDUSW.

So our best bet is simply to range-shift both x and 0xb5 to signed, and do (x-0x8000) > (0xb5 - 0x8000) using the pcmpgtw SIMD signed compare.

Other worse options include:

Checking for overflow in the multiply with pmulhuw to calculate the high half (and check if it's non-zero). We'd be in danger of bottlenecking on multiply throughput, and checking for non-zero with pcmpeqw is the inverse of the condition we want.
psubusw x, 0xb5 and check that it's == 0. pcmpeqw would give us an inverted mask, but we can't use pcmpgtw to check for usat16(x-0xb5) > 0because large inputs (with the high bit set) will stay "negative" after subtracting only a small number like 0xb5.
paddusw and check for == 0xffff: only small-enough inputs would leave the final result not saturated. Some CPUs can run pxor on more ports than padd*, and it doesn't require any fewer non-zero vector constants, so this is not better in any way. But it is equally good on Skylake.

default rel
;;; With a check on the input to saturate the output

    lea      rcx, [rsi + length]      ; rcx = end_pointer
    movdqa   xmm4, [input_saturation_cutoff]
    movdqa   xmm5, [fives]

    pcmpeqd  xmm3, xmm3
    psllw    xmm3, 15        ; xmm3 = set1_epi16(0x8000)  for range-shifting to signed

.loop:
    movdqu   xmm0, [rsi]
    movdqa   xmm1, xmm0

        ; if  x>0xb5 (unsigned), saturate output to 0xffff
    pxor     xmm1, xmm3      ; x - 0x8000 = range shift to signed for compare
    pcmpgtw  xmm1, xmm4      ; xmm1 = (x > 0xb5)  ?  -1  :  0

    pmullw   xmm0, xmm0      ; x*x
    paddusw  xmm0, xmm0      ; *= 2    ; saturating add or not doesn't matter here

    por      xmm1, xmm5      ; 0xffff (saturation needed) else 5.   Off the critical path to give OoO exec an easier time.
    paddusw  xmm0, xmm1      ; += 5  or  += 0xffff  with unsigned saturation.

    movdqu   [rsi], xmm0
    add      rsi, 16
    cmp      rsi, rcx
    jb      .loop

   ...

section .rodata
align 16
   input_saturation_cutoff:  times 8 dw (0x00b5 - 0x8000)  ; range-shifted to signed for pcmpgtw
   fives: times 8 dw 5

 ; 5 = 0xb6 >> 5 or 0xb6 >> 5  but we'd need to knock off the high bit from input_saturation_cutoff
 ; Or we could materialize constants like this:
 ; movdqa  xmm4, [set1w_0xb5]
 ; pcmpeqd xmm3, xmm3
 ; psllw   xmm3, 15              ; rangeshift_mask = set1(0x8000)

 ; movdqa  xmm5, xmm4
 ; psrlw   xmm5, 5               ; fives = set1(5)

 ; pxor    xmm4, xmm3            ; input_sat_cutoff = set1(0x80b5)

 ;; probably not worth it since we already need to load 1 from memory.

I tested this, and paddusw does do 0x2 + 0xffff = 0xffff for example.

We could just POR the final result with the 0 or 0xffff to either leave it unmodified or set it to 0xffff, but modifying the input to the last paddusw creates more instruction-level parallelism within one iteration. So Out-of-Order Execution doesn't have to overlap as many independent iterations to hide the latency of the loop body. (If we were actually scheduling this for in-order Atom or P5 Pentium-MMX, we'd interleave more of the two dep chains.)

Actually, right-shifting by 1 would work: we only need the compare to catch inputs so large that the multiply wraps to a small result. 0xb6 * 0xb6 doesn't wrap, so it saturates on its own just fine from paddubsw.

It's fine if we check (x>>1) > (0xb6>>1) with pcmpgtw (instead of >=) to catch inputs like 0xffff (where pmullw with 0xffff gives us 0x0001). So we could save 1 vector constant, but otherwise this is not better.

pxor + pcmpgtw is at least as cheap as psrlw xmm, 1 + pcmpgtw, unless maybe we're tuning for Intel P6-family (Core2/Nehalem) and running into register-read-port stalls. But that's unlikely: xmm0, xmm1, and rsi should always be hot (recently written and thus read from the ROB, not the permanent register file). We only read 2 vector constants in the first group of 4 instructions in the loop, then 1 later.

As I say below, on many Intel CPUs, psrlw can only run on the same port as pmullw, on the vec-int shift+multiply execution unit. Probably not a throughput bottleneck here, though.

But pcmp and padd run on limited ports (on Intel before Skylake), while pxor can run on any vector ALU port. A mix of purely padd/pcmp/pmul/psrl` uops would leave one vector ALU port unused.

Alternate saturation check idea

(I wrote this part forgetting about the *2 in the formula when I looked for the highest input that wouldn't overflow.)

If the formula had been (0x00ff)^2 + 5, the saturation check would be simpler.

We could just check bit-positions.

(0x00ff)^2 + 5 = 0xfe06 which fits in 16 bits
(0x0100)^2 + 5 = 0x10005 which doesn't, and we want it saturated to 0xffff

So we need to check that the upper 16 bits are all zero, or that x&0xFF == x, or that x>>8 == 0.

This needs fewer constants, but is actually worse than range-shifting to signed with PXOR, because on some CPUs the vector-shift and vector-multiply execution units are on the same port. (And thus psrlw and pmullw compete with each other for throughput. This is enough total uops that we wouldn't actually bottleneck on port 0 on Nehalem / Sandybridge / Haswell, but it doesn't hurt.)

    lea      rcx, [rsi + length]      ; rcx = end_pointer
    movq     xmm5, [fives]
    punpcklqdq  xmm5, xmm5            ; or with SSE3, movddup xmm5, [fives] to broadcast-load
    pxor     xmm4, xmm4               ; xmm4 = 0
.loop:
    movdqu   xmm0, [rsi]
    movdqa   xmm1, xmm0
           ; if  x>0xffU, i.e. if the high byte >0, saturate output to 0xffff
    psrlw    xmm1, 8         ; x>>8  (logical right shift)
    pcmpgtw  xmm1, xmm4      ; xmm1 = ((x>>8) > 0)  ?  -1  :  0

    pmullw   xmm0, xmm0      ; x*x      ; does NOT saturate.  will wrap.
    paddusw  xmm0, xmm0      ; *= 2    ; with unsigned saturation

    por      xmm1, xmm5      ; 0xffff (saturation needed) or 5 (normal).  Off the critical path to give OoO exec an easier time.
    paddusw  xmm0, xmm1      ; += 5  or  += 0xffff  with unsigned saturation.
    movdqu   [rsi], xmm0
    add      rsi, 16
    cmp      rsi, rcx
    jb      .loop

With AVX512BW (Skylake-X) for unsigned compare, using mask registers

We can finally do unsigned integer compares with AVX512F, and on word element size with AVX512BW. But the result is in a mask register instead of a vector, so we can't just vpor it with the vector of set1(5) to create an input for saturating add.

Instead, we can blend between a vector of 5 and 0xffff, according to the compare mask.

;; AVX512BW version

;; On a Skylake Xeon you probably only want to use YMM regs unless you spend a lot of time in this
;;  to avoid reducing max turbo much.
;; Even with xmm or ymm regs (AVX512VL + BW), this demonstrates
;; that we gain even more efficiency than just widening the vectors

;; Just having non-destructive AVX encodings saves the `movdqa xmm1,xmm0` in the SSE2 version.
;; With YMM or XMM regs, most of these instructions can still use shorter VEX encoding (AVX), not the longer EVEX (AVX512)
;;  (Use vmovdqa/u instead of vmovdqu64.  The 64 is element-size, irrelevant when not masking.)

;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;;;;;;

    mov       eax, 0xb5          ;; materialize vector constants from an immediate
    vpbroadcastd  zmm4, eax       ; largest input that won't overflow/saturate
    vpsrlw        zmm5, zmm4, 5   ; fives = 0xb5 >> 5  = 5

    ;vpcmpeqd     xmm3, xmm3            ; all-ones: result on saturation
    vpternlogd    zmm3,zmm3,zmm3, 0xff  ; alternative for ZMM regs, where there's no compare-into-vector

.loop:
    ; alignment recommended for 512-bit vectors, but `u` lets it still work (slower) on unaligned.
    vmovdqu64  zmm0, [rsi]

        ;; if  x>0xb5 (unsigned), saturate output to 0xffff
    vpcmpuw    k1, zmm0, zmm4, 2   ; k1 = x <= 0xb5.   2 = LE predicate
    ; k1 set for elements that WON'T overflow

    vpmullw    zmm0, zmm0      ; x*x
    vpaddusw   zmm0, zmm0      ; *= 2    ; saturating add or not doesn't matter here

    vmovdqa64  zmm1, zmm3               ; 0xffff
    vpaddusw   zmm1{k1}, zmm0, zmm5     ; 0xffff   or  2*x^2 + 5  merge masking

    vmovdqu64  [rsi], zmm1
    add      rsi, 64
    cmp      rsi, rcx
    jb      .loop

(NASM allows vpmullw a, b as a shortcut for vpaddusw a, a, b, when you don't want to take advantage of the non-destructive destination 3-operand encoding, like it does for imul eax, 123.)

An earlier idea for applying the saturation was to use vpblendmw to select between vectors of 5 and 0xffff according to a mask.

vpcmpuw   k1, xmm4, xmm0, 1   ; k1 = 0xb5<x = x>0xb5.   1 = LT predicate numerically because NASM doesn't seem to accept vpcmpltuw the way it accepts vpcmpeqb
; k1 = 1 for elements that WILL overflow.

multiply and add as usual
...

vpblendmw xmm1{k1}, xmm5, xmm3    ; select (k1==0) ? 5 : 0xffff
vpaddusw  xmm0, xmm1              ; += 5  or  += 0xffff  with unsigned saturation.

Copying a register still takes a front-end uop, but not a back-end ALU uop. So (especially for 512-bit registers where port 1 shuts down for vector uops on SKX), this vpblendmb idea is worse than copy and merge-mask.

Besides that, IACA thinks vpblendmw xmm1{k1}, xmm3, xmm5 has an output dependency on XMM1, even though it's actually write only. (I tested by putting 8 of it in a loop, with/without a dep-breaking vpxor). Mask-blend instructions are a special case: for unset mask bits means it copies from src1 (or zero for zero-masking), while for set mask bits it copies from src2.

But the machine encoding uses merge-masking, so it's possible the HW would treat it like any other ALU operation with merge-masking. (Where the output vector is a 3rd input dependency, vpaddw xmm1{k1}, xmm2, xmm3: if mask has any zeros, the result in XMM1 will be the input value of that element.)

This is probably not a problem: according to IACA, SKX can run this at one iteration per 2.24 cycles (bottlenecked on the front-end), so a loop-carried dep chain through XMM1 isn't a problem as long as it's only 1 cycle latency. (If unrolling to reducing loop overhead / front-end bottlenecks, you should use a separate vector for the blend destinations to decouple iterations, even though you can't get it down anywhere near 1 cycle per iteration.)

(And the version using copy + merge-masking into a vector of 0xffff also runs at that throughput, even for ZMM vectors. But IACA thinks the vpblendmb version would be slower with ZMM, even though it says both bottleneck on the front-end...)