Search code examples
jpegsimdlossy-compressionjpeg-xl

JPEG-XL: How many VarDCT implementations are there?


I am looking at the following innocent looking command line:

% cjxl flower.png flower.jxl
[...]
Encoding [Container | VarDCT, d1.000, effort: 7 | 3332-byte XMP], 

Using the exact same source code but compiled on different architectures, here is what I am observing:

Debian arch JPEG XL encoder v0.7.0 Compressed byte
amd64 [AVX2,SSE4,SSSE3,Unknown] Compressed to 450659 bytes including container (1.051 bpp).
arm64 [NEON] Compressed to 450660 bytes including container (1.051 bpp).
armel [Unknown] Compressed to 450828 bytes including container (1.052 bpp).
armhf [NEON,Unknown] Compressed to 450664 bytes including container (1.051 bpp).
i386 [SSE4,SSSE3,Unknown] Compressed to 450678 bytes including container (1.051 bpp).
ppc64el [Unknown] Compressed to 450858 bytes including container (1.052 bpp).

I understand that for lossy compression conformance thresholds are defined in terms of RMSE and peak absolute error per sample. These thresholds are nonzero to allow implementations to use SIMD in the best possible way, which can lead to very slightly different results (obviously without any visible difference).

However in this case there seems to be at least another factor: armel and ppc64el seems to default to the non-SIMD codepath, yet they produce a different output.

Hence my question: how many VarDCT implementations are there ?


For reference:

% file flower.png
flower.png: PNG image data, 2268 x 1512, 8-bit/color RGB, non-interlaced

Solution

  • I think every combination of (compiler, compiler version, cpu architecture) can lead to potentially (slightly) different results. Something as simple as

    float a = b * c + d;
    

    can end up being compiled to very different instructions, causing slight differences in the end result, e.g. a fused multiply-add will be slightly more precise than first doing the multiply and then doing the add; on old platforms that don't have floating point arithmetic the result might be different still; compilers might reorder things or autovectorize things leading to slightly different results.

    The 160-190 byte difference you're getting between soft-float and hard-float seems a bit much and it might be worth investigating why there is so much difference. The difference between armel and ppc64el is probably caused by differences in the respective soft-float implementations and in general the code created by the compiler will be different since those are not the same platform.

    To be clear: conformance is only about decoding, not encoding. Encoders can do whatever they want as long as they produce a valid bitstream — the conformance specification does not say anything about encoding. The only thing that has conformance tolerances is the result of a decoded bitstream, where we do want to have guarantees that the difference in decoded images between various implementations of decoders (including various versions and builds of libjxl) is very small.

    In the encoder, some differences in bitstream size are expected, even when doing lossless encoding. The reason is that some encoder heuristics are implemented using floats, and a slightly different result can lead to a different choice, which can have quite a bit of impact on bitstream size, e.g. if it ends up using a slightly different context model, it can make a difference in the bitstream size even when the actual image data is identical; when doing lossy there can also be slightly different choices in how it selects block sizes and types or adaptive quantization weights, which can lead to differences in both image data and bitstream size.