The meaning and implications of VK_DEPENDENCY_BY_REGION_BIT

An input attachment can be accessed by the subpassLoad GLSL function which samples the input attachment at the current fragment position, i.e. the interface doesn't provide random access. The consequence of this that input attachments cannot be accessed at arbitrary fragment locations.

This practically means [1]:

If a rendering technique requires reading values outside the current fragment area (which on a tiler would mean accessing rendered data outside the currently-rendering tile), separate render passes must be used.

Then, about VK_DEPENDENCY_BY_REGION_BIT the specification says [2]:

If a synchronization command includes a dependencyFlags parameter, and specifies the VK_DEPENDENCY_BY_REGION_BIT flag, then it defines framebuffer-local dependencies for the framebuffer-space pipeline stages in that synchronization command, for all framebuffer regions. If no dependencyFlags parameter is included, or the VK_DEPENDENCY_BY_REGION_BIT flag is not specified, then a framebuffer-global dependency is specified for those stages.

Hans-Kristian Arntzen from ARM [3] suggests that on tiled architectures multi-subpass renderpasses should be used only in conjuction with VK_DEPENDENCY_BY_REGION_BIT:

Next, we try to merge adjacent render passes together. This is particularly important on tile-based renderers. We try to merge passes together if:

They are both graphics passes

They share some color/depth/input attachments

Not more than one unique depth/stencil attachment exists

Their dependencies can be implemented with BY_REGION_BIT, i.e. no “texture” dependency, which allows sampling for arbitrary locations.

Now the questions are:

If you cannot access fragments outside of the current fragment location anyway, what is the point of VK_DEPENDENCY_BY_REGION_BIT?
On tiled architectures does a multi-subpass render pass where subpass dependencies cannot be declared with VK_DEPENDENCY_BY_REGION_BIT provide any performance advantage over functionally equivalent properly-synchronized series of separate single-subpass render passes?

Solution

Well, the specification gives one example. If you want to access a sample of the input attachment that is not covered by the fragment, then you have to use framebuffer-global dependency (i.e. dependencyFlags = 0, or one of the vendor extension fixes that).

Though the most obvious example are non-attachment resources, which are naturally random access (where you can access any pixel). With VK_DEPENDENCY_BY_REGION_BIT only the part that was written for the same fragment can ever be certain to be visible. While with framebuffer-global dependency (dependencyFlags=0), you could access a location in a storage buffer written by any fragment shader invocation of the previous subpass.

dependencyFlags=0 is sort of a soft-restart of the Render Pass. So everything being the same I would grade the performance this way:
single Subpass ≥ multiple Subpasses with VK_DEPENDENCY_BY_REGION_BIT ≥ multiple subpasses without VK_DEPENDENCY_BY_REGION_BIT ≥ multiple render passes.

Whether framebuffer-global subpasses actually provide any performance advantage I cannot say without measurement of a particular implementation (and that would potentially be a perishable information, changing with new GPUs, or perhaps even driver versions). Though the case should not be worse than a separate render pass, which would likely be the worst demotion the driver itself would do if it cannot do anything with those special subpasses.