image-processing assembly sse avx lanczos

Lanczos SSE/AVX implementation

Does anybody have any tips on how to implement Lanczos image resampling (upscaling and downscaling) algorithm using SSE/AVX (either intrinsic functions or assembly)?

I looked on some C implementations but there is lot of branching and I don't really know, how to implement it smartly using SSE/AVX.

Example - normalized cardinal sin:

// C implementation
if (!x)
  return sin(x*M_PI)/(x*M_PI);
else
  return 1;

// AVX implementation
PXOR ymm0, ymm0
MOVAPD ymm1, [x]     // x - array of double
CMPPD ymm0, ymm1, 0  // if (!x)
// what now?

MOVAPD ymm3, [pi]    // pi - array of double = M_PI - is there better way?
PMULPD ymm1, ymm3    // ymm1 = x*pi
(SINPD ymm2, ymm1)   // found intrinsic _mm256_sin_pd - Intel math library, intrinsic functions are OK with me
DIVPD ymm2, ymm1     // result in ymm2

How should I return 1 for values x == 0? On that indexes, I have 11...11 (true) after CMPPD.

Also, I am doing this for greyscale, 8bit picture, so one pixel is only (0..255). What effect on quality would have use of float instead of double? And would it be possible to work with u_int8 whole time and don't convert to real numbers at all (error would be probably substantial)?

Solution

If you don't already know asm or SSE/AVX, learning one at a time may be easier. Writing vector algorithms with C/C++ intrinsics will give you a more portable implementation than using asm directly. (Compile for 32 vs. 64bit, and windows or everything else, instead of needing 2 or 4 different asm versions (or #ifdef-equivalent macros in your asm).

Looking at the compiler output as you write the C can be helpful, to make sure the loads/stores are happening the way you expect, and the compiler isn't doing anything stupid with bloated code due to aliasing / alignment (lack of) assumptions, or storing / generating constants.

Debugging vector code is hard enough (since there's a lot more state to keep track of, and you have to mentally follow things through shuffles).

I'd start with finding some parts of the C that might vectorize, if the compiler isn't auto-vectorizing already, and using intrinsics in the C. Then once that was working, I might take the compiler output and hand-tune it in the places where the compiler didn't make optimal code. (See http://agner.org/optimize/)

As far as processing image data as float vs. int, well if you can get away with 16bit fixed-point, that'll be faster (unless it requires more instructions). See my answer on another image-processing question about using floats vs. fixed-point.

The only math instruction in SSE (beyond the basic add/sub/mul/div) is sqrt. Trig / log / exp are all library functions. Notice that in Intel's intrinsic guide, the "instruction" field is blank, meaning it maps to multiple instructions. Only Intel's compiler even provides these composite intrinsics.

Anyway, you'll need to either find a sin implementation to inline, or save some registers and make a function call. Depending on the ABI (windows or everything else), some or all xmm registers can be clobbered by functions. Using a specific sin implementation would let you know which registers it needs, and only spill those. (Since you're programming in asm, you can make functions which know more about each other than if they just followed the ABI.)

If you only ever need to calculate sin(x*PI), you could make a custom sin function that does that, saving the trouble of pre-multiplying by PI. Since an ideal implementation of sin chooses what algorithm to use based on the range of the input, you probably can't get a vectorized implementation that's accurate to the last bit of the mantissa. Fortunately, you probably don't need that, so google up an SSE sin(x) implementation.

Conditionals in SIMD vectors are handled by compares making a vector of elements that are either all-zero or all-one. You can then AND or OR those onto other vectors. It works well for things like adding where the identity value is 0. (x + 0 = x, so you can filter out some elements from a vector before adding the vector to an accumulator). If you need to select between two source elements based on a vector of 0 / -1, you can AND/OR things together, or do the same job faster with a blendvps (variable-blend packed scalar, rather than a compile-time-constant blend).

This idea breaks down a bit if you want to avoid calculating a slow divide-by-zero in the first place, rather than the usual just doing the calculation for everything and then masking/blending. Since you want the result to come out to 1 when x == 0.0, your best bet might be to set zero-elements of x to FLT_MIN * 16 or something before computing any of sin(x*PI)/(x*PI). That way, you avoid divide by zero, and the result of the division comes out very close to 1. If you need it to come out to exactly 1.0f (and there isn't a value of x that makes sin(x*PI) == x*PI with your sin implementation) then you'd need to blend twice: once in the numerator and once in the denominator. (To set them both to the same non-zero value).

vxorps     xmm15, xmm15, xmm15   ; if you can spare a reg to hold a zero constant



; inside your loop:  xmm0 holds { x3, x2, x1, x0 }.
vcmpeqps     xmm1, xmm0, xmm15   ;; mnemonic for vcmpps xmm1, xmm0, xmm15, 0.
;;  Different predicates are an immediate operand, not different opcodes


vblendvps  xmm2, xmm0, [memory_holding_vector_of_float_min], xmm1  ; Or cache it in a reg if you have one to spare
     ; blendv takes elements from the 2nd source operand when the selector (xmm1) has a 1-bit in the MSB (sign bit)

; xmm2 = (x==0.0f) ? FLT_MIN : x
;  xmm1 holds { sin(x3*pi), sin(x2*pi)... }

Note that cmpps has a wider choice of predicate in the AVX VEX-coded version than in the SSE version.