Shader compiler on Alderlake GT1: SIMD32 shader inefficient

When I compile and link my GLSL shader on an Alderlake GT1 integrated GPU, I get the warning:

SIMD32 shader inefficient

This warning is reported via glDebugMessageCallbackARB mechanism.

I would like to investigate if I can avoid this inefficiency, but I am not sure how to get more information on this warning.

The full output from the driver, for this shader:

WRN [Shader Compiler][Other]{Notification}: VS SIMD8 shader: 11 inst, 0 loops, 40 cycles, 0:0 spills:fills, 1 sends, scheduled with mode top-down, Promoted 0 constants, compacted 176 to 112 bytes.

WRN [API][Performance]{Notification}: SIMD32 shader inefficient

WRN [Shader Compiler][Other]{Notification}: FS SIMD8 shader: 5 inst, 0 loops, 20 cycles, 0:0 spills:fills, 1 sends, scheduled with mode top-down, Promoted 0 constants, compacted 80 to 48 bytes.

WRN [Shader Compiler][Other]{Notification}: FS SIMD16 shader: 5 inst, 0 loops, 28 cycles, 0:0 spills:fills, 1 sends, scheduled with mode top-down, Promoted 0 constants, compacted 80 to 48 bytes.

The messages are created during the fragment shader compiling, by the way.

My vertex shader:

#version 150
in mediump vec2 position;
out lowp vec4 clr;
uniform mediump vec2 rotx;
uniform mediump vec2 roty;
uniform mediump vec2 translation;
uniform lowp vec4 colour;
void main()
{
    gl_Position.x = dot( position, rotx ) + translation.x;
    gl_Position.y = dot( position, roty ) + translation.y;
    gl_Position.z = 1.0;
    gl_Position.w = 1.0;
    clr = colour;
}

My fragment shader:

#version 150
in  lowp vec4 clr;
out lowp vec4 fragColor;
void main()
{
    fragColor = clr;
}

That said, I doubt it is shader specific, because it seems to report this for every shader I use on this platform?

GL RENDERER: Mesa Intel(R) Graphics (ADL-S GT1)

OS: Ubuntu 22.04

GPU: AlderLake-S GT1

API: OpenGL 3.2 Core Profile

GLSL Version: 150

Solution

This seems to come from an Intel fragment shader compiler, that is part of Mesa.

brw_fs.cpp

Looking at this code, it seems that the compiler has three options: to use SIMD8, SIMD16 or SIMD32. This refers to widths, not to bits. So SIMD32 is 32-wide SIMD.

The compiler uses a heuristic to see if the SIMD32 version will be efficient, and if not, it skips that option.

Of course, this heuristic can get it wrong, so there is an option to force the BRW compiler to try SIMD32 regardless.

The environment variable setting INTEL_DEBUG=do32 will tell the compiler to try the SIMD32 as well.

When I tested this on my system, I indeed observed that the driver now reports three different results:

WRN [Shader Compiler][Other]{Notification}: FS SIMD8 shader: 5 inst, 0 loops, 20 cycles, 0:0 spills:fills, 1 sends, scheduled with mode top-down, Promoted 0 constants, compacted 80 to 48 bytes.

WRN [Shader Compiler][Other]{Notification}: FS SIMD16 shader: 5 inst, 0 loops, 28 cycles, 0:0 spills:fills, 1 sends, scheduled with mode top-down, Promoted 0 constants, compacted 80 to 48 bytes.

WRN [Shader Compiler][Other]{Notification}: FS SIMD32 shader: 10 inst, 0 loops, 928 cycles, 0:0 spills:fills, 2 sends, scheduled with mode top-down, Promoted 0 constants, compacted 160 to 96 bytes.

Observe that in this case, the heuristic definitely got it right: almost 50 times more cycles than SIMD8?

Fun fact: BRW stands for Broadwater, gen4 graphics. But gen12 Intel GPUs still use this compiler.