Vulkan subgroupBarrier does not synchronize invokations

I have a somewhat complex procedure that contains nested loop and a subgroupBarrier. In a simplified form it looks like

while(true){
   while(some_condition){
      if(end_condition){
          atomicAdd(some_variable,1);
          debugPrintfEXT("%d:%d",gl_SubgroupID,gl_SubgroupInvocationID.x);
          subgroupBarrier();
          if(gl_SubgroupInvocationID.x==0){
              debugPrintfEXT("Finish! %d", some_variable);
              // do some final stuff
          }
          return; // this is the only return in the entire procedure
      }
      // do some stuff
   }
   // do some stuff
}

Overall the procedure is correct and it does what's expected from it. All subgroup threads always eventually reach the end condition. However, in my logs I see

0:2
0:3
0:0
Finish! 3
0:1

And it's not just the matter of logs being displayed out of order. I perform atomic addition and it seems to be wrong too. I need all threads to finish all their atomic operations before printing Finish!. If the subgroupBarrier() worked correctly, it should print 4, but in my case it prints 3. I've been mostly following this tutorial https://www.khronos.org/blog/vulkan-subgroup-tutorial and it says that

void subgroupBarrier() performs a full memory and execution barrier - basically when an invocation returns from subgroupBarrier() we are guaranteed that every invocation executed the barrier before any return, and all memory writes by those invocations are visible to all invocations in the subgroup.

Interestingly I tried changing if(gl_SubgroupInvocationID.x==0) to other numbers. For example if(gl_SubgroupInvocationID.x==3) yields

0:2
0:3
Finish! 2
0:0
0:1

So it seems like the subgroupBarrier() is entirely ignored.

Could the nested loop be the cause of the problem or is it something else?

Edit:

I provide here more detailed code

#version 450
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_EXT_debug_printf : enable

layout (local_size_x_id = GROUP_SIZE_CONST_ID) in; // this is a specialization constant whose value always matches the subgroupSize

shared uint copied_faces_idx;

void main() {
    const uint chunk_offset = gl_WorkGroupID.x;
    const uint lID = gl_LocalInvocationID.x;
    // ... Some less important stuff happens here ...
    const uint[2] ending = uint[2](relocated_leading_faces_ending, relocated_trailing_faces_ending);
    const uint[2] beginning = uint[2](offset_to_relocated_leading_faces, offset_to_relocated_trailing_faces);
    uint part = 0;
    face_offset = lID;
    Face face_to_relocate = faces[face_offset];
    i=-1;
    debugPrintfEXT("Stop 1: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
    subgroupBarrier(); // I added this just to test see what happens
    debugPrintfEXT("Stop 2: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
    while(true){
        while(face_offset >= ending[part]){
            part++;
            if(part>=2){
                debugPrintfEXT("Stop 3: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
                subgroupBarrier();
                debugPrintfEXT("Stop 4: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
                for(uint i=lID;i<inserted_face_count;i+=GROUP_SIZE){
                    uint offset = atomicAdd(copied_faces_idx,1);
                    face_to_relocate = faces_to_be_inserted[i];
                    debugPrintfEXT("Stop 5: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
                    tmp_faces_copy[offset+1] = face_to_relocate.x;
                    tmp_faces_copy[offset+2] = face_to_relocate.y;
                }
                subgroupBarrier(); // Let's make sure that copied_faces_idx has been incremented by all threads.
                if(lID==0){
                    debugPrintfEXT("Finish! %d",copied_faces_idx);
                    save_copied_face_count_to_buffer(copied_faces_idx);
                }
                return; 
            }
            face_offset = beginning[part] + lID;
            face_to_relocate = faces[face_offset];
        }
        i++;
        if(i==removed_face_count||shared_faces_to_be_removed[i]==face_to_relocate.x){
            remove_face(face_offset, i);
            debugPrintfEXT("remove_face: %d %d",gl_SubgroupID,gl_SubgroupInvocationID.x);
            face_offset+=GROUP_SIZE;
            face_to_relocate = faces[face_offset];
            i=-1;
        }
    }
}

Basically what this code does is equivalent to

outer1:for(every face X in polygon beginning){
   for(every face Y to be removed from polygons){
      if(X==Y){
         remove_face(X);
         continue outer1;
      }
   } 
}
outer2:for(every face X in polygon ending){
   for(every face Y to be removed from polygons){
      if(X==Y){
         remove_face(X);
         continue outer2;
      }
   } 
}
for(every face Z to be inserted in the middle of polygon){
   insertFace(Z);
}
save_copied_face_count_to_buffer(number_of_faces_copied_along_the_way);

The reason why my code looks so convoluted is because I wrote it in a way that is more parallelizable and tries to minimize the number of inactive threads (considering that usually threads in the same subgroup have to execute the same instruction).

I also added a bunch more of debug prints and one more barrier just to see what happens. Here are the logs that i got

Stop 1: 0 0
Stop 1: 0 1
Stop 1: 0 2
Stop 1: 0 3
Stop 2: 0 0
Stop 2: 0 1
Stop 2: 0 2
Stop 2: 0 3
Stop 3: 0 2
Stop 3: 0 3
Stop 4: 0 2
Stop 4: 0 3
Stop 5: 0 2
Stop 5: 0 3
remove_face: 0 0
Stop 3: 0 0
Stop 4: 0 0
Stop 5: 0 0
Finish! 3   // at this point value 3 is saved (which is the wrong value)
remove_face: 0 1
Stop 3: 0 1
Stop 4: 0 1
Stop 5: 0 1 // at this point atomic is incremented and becomes 4 (which is the correct value)

Solution

I found the reason why my code did not work. So it turns out that I misunderstood how exactly subgroupBarrier() decides which threads to synchronize. If a thread is inactive then it will not participate in the barrier. It doesn't matter whether the inactive thread will later become active and will eventually reach the barrier.

Those two loops are not equivalent (even though it seems like they are)

while(true){
   if(end_condition){
      break;
   }
}
subgroupBarrier();
some_function();

and

while(true){
   if(end_condition){
      subgroupBarrier();
      some_function();
      return;
   }
}

If all threads reach the end condition in the exact same iteration, then there is no problem, because all threads are active at the same time.

The issue appears when different threads might exit the loop at different iterations. If thread A passes the end condition after 2 iterations and thread B passes end condition after 3 iterations, then there will be one entire iteration between them when A is inactive and waiting for B to finish.

In the first scenario, A will reach break first, then B will reach break second and the finally both threads will exit the loop and arrive at the barrier.

In the second scenario, A will reach the end condition first and execute the if statement, while B will be inactive, waiting for A to finish. As A reaches the barrier it will be the only active thread at that point in time and hence it will pass through the barrier without synchronizing with B. Then A will finish executing the body of if statement reach return and become inactive. Then B will actually become active again and finish executing its iteration. Then in the next iteration it will reach end condition and barrier and again ti will be the only active thread so the barrier won't have to synchronize anything.