I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one
layout(std430, set = 0, binding = 2) buffer Particles{
Particle particles[];
layout(std430, set = 0, binding = 4) buffer Constraints{
Constraint constraints[];
void main(){
const uint gID = gl_GlobalInvocationID.x;
for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
// first query the constraint, which contains particle_id_1 and particle_id_1
const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass];
// read newest positions
vec3 position1 = particles[c.particle_id_1].position;
vec3 position2 = particles[c.particle_id_2].position;
// modify position1 and position2
position1 += something;
position2 -= something;
// update positions
particles[c.particle_id_1].position = position1;
particles[c.particle_id_2].position = position2;
// in the next iteration, different constraints may use the updated positions
From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position
I copy some of the data from L2 to L1 (or directly to a register).
Then in position1 += something
I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1
, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles
. It would be enough to just put an execution barrier, without memory barrier
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkPipelineStageFlags dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkDependencyFlags dependencyFlags, // here nothing
uint32_t memoryBarrierCount, // here 0
const VkMemoryBarrier* pMemoryBarriers, // nullptr
uint32_t bufferMemoryBarrierCount, // 0
const VkBufferMemoryBarrier* pBufferMemoryBarriers, // nullptr
uint32_t imageMemoryBarrierCount, // 0
const VkImageMemoryBarrier* pImageMemoryBarriers); // nullptr
Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.
If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.
The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.
Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.
However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.
It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent
qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.