How can I generalize Diamond Tiling to higher dimensions?

Background

This is a real programming problem I've encountered in my attempt to optimize a PDE solver. Before I can ask the question in an understandable way, I have to first introduce some background and the history of the research in this field.

Stencil Computation

In scientific and engineering computing, simulations of physical systems governed by some Partial Differential Equations (PDEs) is a common task. These numerical PDE solvers usually calculate the state of a system from its current timestep t to its next timestep t + 1 according to some update equations. The future state t + 1 at position x (state(x, t + 1)) often depends on the current state of x's nearest neighbors. For example, when using Jacobi method to solve the 1D heat equation, the dependency chain is state(x, t + 1) <- state(x - 1, t), state(x, t), state(x + 1, t). Because the data dependencies always have a unique shape, it's known to HPC programmers as a stencil, and this type of computation is named iterative stencil computation.

The following are some examples of stencils: (1) A Jacobi solver of 2D heat equation has the dependency chain of itself and its top, bottom, left, and right, similar to the D-Pad on a game controller (von Neumann neighborhood). (2) An implementation of Conway's Game of Life has the dependency of north, south, east, west, northeast, northwest, southeast and southwest (Moore neighborhood). (3) A 2D convolution kernel in image processing.

The problem I'm working on is Computational Electromagnetism (CEM) using the Finite-Difference Time-Domain method to solve the Maxwell's equations in 3D space. Its stencil shape is more complex (because of the leapfrog update between the electric and magnetic fields), but it's otherwise rather similar to the 7-point 3D stencil - which itself is a generalization of 3-point 1D stencil. Thus, for clarity, the rest of this question only discusses the 3-point 1D stencil (and its generalizations in higher dimensions), as shown below. We also assume all elements outside the boundary are 0 (Dirichlet boundary condition).

  (t)       x_1
            /|\
           / | \
          /  |  \
(t - 1) x_0 x_1 x_2 x_3 x_4 x_5 x_6 ...

A naive implementation looks like something below:

for (size_t t = 0; t < T_MAX; t++) {
    for (size_t x = 0; x < X_MAX; x++) {
        float prev   = x > 0 ? now[x - 1] : 0;
        float center = now[x];
        float next   = x < X_MAX - 1 ? now[x + 1] : 0;

        future[x] = f(prev, center, next);
    }
    swap(&future, &now);  /* swap front and back buffer */
}

Memory Wall

It's a well-known fact that numerical simulations based on stencil computations have an extreme memory bandwidth bottleneck, especially in a naive implementation. The naive implementation creates a tremendously amount of DRAM traffic. Consider a 1 GB simulation domain with 10000 timesteps on a desktop computer with 40 GB/s memory bandwidth (dual-channel DDR4-3200). Because the simulation domain is too large to fit in cache, each timestep involves a full memory scan. It has an extremely low arithmetic intensity, one can use the roofline model to demonstrate its extremely low performance below the CPU's peak FLOPS.

We can visualize the iteration space of the loop using a 2D spacetime diagram (1D space + 1D time), each color represents a unit of work:

In ordinary programming, loop tiling (loop blocking) is the standard solution to this problem. The array is broken into multiple rectangular blocks and are calculated tiles by tiles. However, in the field of stencil computation there are several unique challenges. First, conventional loop tiling are spatial (space) tiling, but physics simulation often apply just a single operation to each tile, not multiple operations, there's still no little to none data reuse even if the loop is blocked. Instead, we have to use temporal (time) tiling to calculate multiple timesteps at once. Next, stencil computation has data dependencies that extends beyond the tile itself to its neighboring tiles, so preserving the correct data dependencies is now the problem. In particular, time tiling means that the simulation domain becomes asynchronous, every element can stay in a different timestep - without an appropriate tile shape that brings natural synchronization, it would be a nightmare to handle.

Parallelogram Tiling

Historically, parallelogram tiling is the first solution to this problem. First, the space is broken into small rectangular tiles. Then, at each timestep, the range of updated elements shrinks one element to the right, because they cannot be updated due to missing dependencies. On the other hand, the range also grows one element to the left. By using a sliding window of two tiles, updating the next tile automatically "fixes" the elements with the wrong timesteps on the left.

This tiling algorithm is visualized in the following spacetime diagram:

One can see the following desirable properties of this tiling:

There's no dependency violation anywhere, as long as the tiles are iterated from the left to the right (note that we're buffering the system's state in two timesteps in now and future - see the example code above, so accessing one timestep below the updated element is safe).
Updating the right tile automatically "fixes" the elements with the wrong timesteps of the left right.
At the end of the process, all tiles are naturally re-aligned to the same timestep, even though the simulation domain becomes asynchronous during the process itself.
Using a sliding window buffer that contains two tiles, there are no redundant memory access, even though the two tiles overlap.
The width of the tile should be half of the cache size. Acceleration is unlimited and grows linearly with cache size (up to the cache bandwidth bottleneck). Larger cache allows one to calculate more timesteps at once.
It can be readily generalized to higher dimensions. For example, in 3D, we first use a nested triple-loop to select a tile, then within a tile, we use a nested quad-loop to grow and shrink the x, y, and z ranges per timestep.

Fatal Flaw: No Parallelism.

However, this tiling algorithm also has a fatal flaw: the lack of inter-tile parallelism. To preserve dependencies, all tiles must be iterated from the left to the right. There exists three workarounds:

Process a single tile in parallel, and do it serially one by one - the cost is usually far higher than the parallel speedup.
Process all tiles in parallel, but insert a barrier at the end of each timestep - the cost is usually far higher than the parallel speedup, and cannot be applied to GPUs due to lack of inter-workgroup barrier.
Hyperplane or wavefront parallelism: As a matter of fact, updating a single region in the simulation doesn't instantaneously affect other parts. Instead, the propagation of dependencies follow a particular path in spacetime. In 1D space, it's not particularly interesting - changing a left region causes a corresponding change in its right region, one tile at a time. But in 2D space and above, a single region may affect multiple regions, thus creating parallelism. Researchers used this insight to create another form of parallel algorithm called hyperplane or wavefront parallelism. It's called wavefront parallelism because the updated regions follow how a wave spreads out in space. It's called hyperplane parallelism because the parallel processing follows the propagation of causality (thus information) within the simulation box on a hyperplane (in 2D, a line, in 3D, a plane), just like the propagation of lightcones in special relativity's Minkowski spacetime - except that our lightcone eventually shrinks due to finiteness of the simulation.

In 2D space, we can first break the simulation domain into small rectangle supertiles. Each supertile is processed in serial using parallelogram tiling within, but different supertiles can sometimes be processed in parallel based on their nearest-neighbor dependency chain.

For example, in the 1st pass, the only independent supertile is (0, 0) - it's located at the top-left edge of the universe with all-zero boundary conditions, so it does not depend on anything else. In the 2nd pass, we can see that the supertile (0, 0) only has two neighbors: (1, 0) and (0, 1). Thus, only these two supertiles has cause-and-effect relationship with (0, 0), and they're also independent between themselves. In the 3rd pass, (2, 0), (1, 1), (0, 2) are independent, etc. The same technique readily generalizes to 3D space. However, hyperplane parallelism has the problem of irregularity - the independent tiles are first growing, reaching a maximum, then shrinking. This phenomenon is known as pipelined startup - the hypothetical "pipeline" is filled then drained. This creates significant synchronization overhead and load imbalance problems.

Trapezoid Tiling

To overcome the lack of parallelism of in parallelogram tiling, an alternative tiling algorithm was invented by researchers, known as trapezoid tiling (or trapezoidal tiling). The innovation here is that, instead of using a single shape of tiles, it uses split-tiling with two different tile shapes: An invented trapezoid that shrinks over time, known as "mountains", and a regular trapezoid that grows over time, known as "valleys".

In 1D case is visualized in the following 2D spacetime diagram:

As one can see, different mountains are completely independent and can be processed in parallel, because trapezoid tiles provide a natural spacing for avoiding the dependencies. After all mountains are processed, all valleys are then again processed in parallel (in the picture, valleys at the boundary can either be treated as an incomplete valley, or be treated as part of the first two mountains). In the 1D case, one only needs two synchronizations for many timesteps. Providing great parallelization on massively parallel computers.

Generalization to Higher Dimensions

Trapezoid tiling can also be generalized to higher dimensions naturally. For example, in 2D, each dimension can be considered as 1D separately, 1D tiles from each dimension are combined together to form a 2D plane (and a 3D spacetime), or a 3D cube (and a 4D spacetime). Reasoning it geometrically can be extremely difficult or impossible, but fortunately it's easy to reason about it algebraically: The order of combinations following the Cartesian product of the tile types from each dimension. In other words, 2D needs 4 stages of processing, and 3D needs 8 stages of processing.

Here's how to generalize trapezoid tiling to 2D space (3D spacetime):

Combine all X mountains with Y mountains, process all in parallel

Its 3D spacetime diagram is:
Combine all X mountains with Y valleys, process all in parallel

Its 3D spacetime diagram is:
Combine all X valleys with Y mountains, process all in parallel

Its 3D spacetime diagram is:
Combine all X valleys with Y valleys, process all in parallel

Its 3D spacetime diagram is:

Note that there's only a single time axis shared by all dimensions, and all the timesteps of different tiles are always aligned. For example, in 3D space, the ranges x, y and z grows or shrinks simultaneously in each timestep, allowing its clean generalization to high-dimensional space.

Fatal Flaw: Redundant Memory Accesses

Unfortunately, as great as trapezoid tiling initially seems to be, it also has a fatal flaw, which is redundant memory accesses at the overlapped regions of different tiles. To see how it happens, recall the 2D spacetime diagram of 1D trapezoid tiling:

Although each tiles have a trapezoidal shape, but these tiles are actually overlapped 1D lines in terms of memory access. In other words, trapezoid tiling creates parallelism at the expense of redundant memory access.

overlapped memory access in trapezoid tiling

In fact, in trapezoid tiling, calculating as many timesteps as possible creates the highest overhead due to redundant traffic, because in this case, the "valleys" at the middle mainly exists to fix the mess from the existing work, as a result, they themselves have no chance to perform productive original work (sounds like a office space analogy...).

We can reduce the overlapped region and minimizing waste, there are two workarounds:

Reduce the number of timesteps, but at the expense of doing less work - which also creates inefficiency.

For example, to minimize the overlapped area to just 2 elements:

It would be only possible to calculate 2 timesteps:
Increase the width of tiles, but it's limited by cache size.
Use hierarchical cache tiling for both L2 and L3 cache, each supertile fits inside L3 cache and each mountain or valley fits in L2 cache. This recursive process can be repeatedly from small tiles to big tiles ad infinitum in principle to utilize all levels of storage in the entire memory hierarchy, including L1, L2, L3, and even SSD page files. Thus, the Sierpinski triangle can be a theoretically optimal solution. This kind of fractal algorithms is known as a cache-oblivious algorithm, as it's optimal without considering the cache parameters. This is also closely related to the Locally-Recursive non-Locally Asynchronous (LRnLA) algorithms purposed by a Russian research group from Keldysh Institute of Applied Mathematics.

But this greatly complicate the logic of the PDE solver, which is already difficult enough to understand. Moreover, for GPUs with tiny on-chip local memory, practicality is limited.

In 1D space, the overlapped memory accesses are only a small overhead, but the overhead grows exponentially in higher dimensions - nearly quadratically in 2D space and cubically in 3D space. For example, if the simulation domain can be split into 10 non-overlapped rectangles and around 20 overlapped rectangles in the worst case (common when the each tile is small due to cache size limitation), there's a 800% memory traffic penalty, which is sufficient to eliminate any benefits from trapezoid tiling. The penalty of 2D trapezoid tiling is smaller but also significant, especially on GPUs with tiny local memory.

Diamond Tiling and Hexagon Tiling

To overcome the problem of redundant memory accesses in trapezoid tiling, researchers later proposed a small modification of trapezoid tiling, known as diamond tiling. The insight of diamond tiling is that, when a valley finishes its calculation, it already contains all the data dependencies it needs to immediately start constructing another mountain:

After this step, diamond tiling finishes initialization. First, all mountains and all valleys swap their roles. But note that since all valleys themselves will become diamonds.

diamond tiling after the first role swap between mountains and valleys

As a result, we now have a simulation domain tiled exclusively by diamonds. There are two types of diamonds, one type is running several timesteps ahead than another type. I'll call one type "slow diamonds" and another type "fast diamonds".

diamond tiling finishes initialization and now contains two types of diamonds

To terminate diamond tiling, the last diamond tiles are truncated in the middle at the final timestep, so the diamonds degenerate back into ordinary mountains or valleys (omitted in the diagram). This is essentially doing the startup process in reverse.

Also note that in the special case when the width of the diamond begins and ends with more than 1 elements, it's also known as hexagon tiling (hexagonal tiling).

As a result, diamond tiling solves the problem of redundant memory access overhead in a simple and elegant manner, the problem is solved, right?

Fatal Flaw: Problematic Generalization to Higher Dimensions

Unfortunately, the literature says that diamond tiling cannot be generalized to high-dimension space in a clean manner. Personally, I don't even know how it can be done at all. The difficulty is that in diamond tiling, there's a misalignment of tiles on the time axis, and the Cartesian product trick no longer works.

During startup, if we attempt to process the tiling using the following schedule using the Cartesian product trick:

Combine all X mountains with Y mountains, process all in parallel
Combine all X mountains with Y diamonds, process all in parallel.
Combine all X diamonds with Y mountains, process all in parallel.
Combine all X diamonds with Y diamonds, process all in parallel.

As one can see, 3 of these 4 steps have mismatched timesteps, which means those diamonds degenerates back to mountains. Only the last step (diamond, diamond) remains a diamond. It suggests that one cannot take full performance advantage of diamond tiling in multiple dimensions - okay so far this still looks reasonable,

After the diamond tiling finishes initialization and is fully started, the dependencies become more puzzling:

Combine all X slow diamonds with Y slow diamonds, process all in parallel
Combine all X slow diamonds with Y fast diamonds, process all in parallel.
Combine all X fast diamonds with Y slow diamonds, process all in parallel.
Combine all X fast diamonds with Y fast diamonds, process all in parallel.

At this point, the tile dependencies become extremely problematic and puzzling. Only two combinations are time-aligned, which are (slow diamond, slow diamond) and (fast diamond, fast diamond). the remaining two have misaligned time axis at the bottom of the tile. Without the bottom half of the diamond, it's impossible to calculate the top half of the diamond tiles.

As a result, in practice, diamond tiling is often only applied to one dimension, other dimensions are parallelized using parallelogram tiling or other more conventional methods. However, this greatly reduces the program's natural parallelism - a 2D 10x10 trapezoid tiling has 100 independent block, a 1D tiling only has 10, making it unsuitable for parallel computation.

Mysterious Russian Magic

The only hint I can find in the literature on how it may be done comes from a Russian research group at Keldysh Institute of Applied Mathematics. This group developed a series of different parallel spacetime decomposition algorithms in the last 20 years using a methodology called Locally-Recursive non-Locally Asynchronous (LRnLA) algorithms. LRnLa is not a single algorithm but a general guideline of how to design them. Basically: (1) The tessellation must should be a recursive process to utilize the entire memory hierarchy (it's not necessarily automatic, manual parameter tuning for different machines is allow as long as tiling algorithm itself can be generalized). (2) Parallelism should exist between different tiles, the dependency conflict problem should be solvable in some natural way. Using both requirements as starting point, researchers would manually examine the stencil dependencies and use their intuition in solid geometry to design custom algorithms to satisfy these goals. Unlike polyhedron compilers, these are custom solutions designed by human domain experts for human use, with geometric interpretations that ease implementations (but only from the perspective of mathematicians and physicists...).

2D and 3D diamond tiling were known to them as the ConeTur algorithm, which were used in the first generation of HPC code in the late 1990s. In the paper The DiamondCandy Algorithm for Maximum Performance Vectorized Cross-Stencil Computation by Perepelkina A.Y. and Levchenko V.D., they described it as:

The ConeTur algorithm had some key aspects. It is easily generalized to higher dimensions by a superposition of 1D1T shapes. With the recursive subdivision for 1D simulation, the resulting shapes in 1D1T time-space were triangles (upright and upside-down) and diamonds, and there were special cases at the boundaries. In 2D1T the pyramids are tiled along with octahedrons and tetrahedrons, and the number of special shapes for the boundaries increases even more. For 3D1T the programming effort becomes unreasonable, so code generators were used.

Unfortunately there's no step-by-step procedure on how it may be implemented. I don't understand how can these pyramids, octahedrons and tetrahedrons be combined together. Furthermore, this algorithm was already obsolete from their perspective, so they likely have no interest to better explain it in the future, as they have later developed even faster algorithms such as ConeFold, DiamondTorre, and DiamondCandy. Since these operate directly within 4D spacetime, they're even more difficult to understand. As far as I know, these algorithms are original and are not used or explained by anyone else. For a physicist who can already picture a 4D Minkowski spacetime and even doing math inside it, it may be obvious. But for everyone else, even the simplest case of ConeTur is difficult. This group is at least 10 years ahead to the rest of the world in this field of research.

Question

Is it possible to generalize diamond tiling from 1D space to higher dimensions? A 2D space solution would already be satisfactory, 3D space would be even better (but I doubt anyone can answer that). If it's possible, what is the detailed procedure to do so? In particular, please explain the loop scheduling step by step, hopefully with mechanical rules that are easy to follow, and with projections to lower dimensions, similar to I wrote my original question...

This is clearly just a 3D space tessellation problem, which is solvable without specific programming or physics knowledge. But how, beyond the hints given in the paper by Perepelkina A.Y. and Levchenko V.D?

Strictly speaking, it has already been solved, but the literature is unreadable for outsiders. Stencil computation is a classic field in HPC research, and tiling techniques have been studied since the 1970s, as a result, the papers are full of definitions, axioms, lemma, and proofs about the difficult mathematical properties of these optimizations, or about an universal compiler framework that would work on any tile shape (e.g. see polyhedron compiler) - often without telling you what the tiles even look like graphically because everyone who's working in this field already knew. There's a serious lack of introductory materials on temporal tiling suitable for the purpose of human optimization, rather than code generation.

Update: I asked the question elsewhere. A polyhedron compiler researcher saw it and told me that even though I'm looking for intuitive geometry and algebraic interpretations suitable for hand coding, but an automatic stencil or polyhedron compiler may still be helpful. It can be insightful to feed the 2D Jacobi case into a ready-made polyhedron compiler (with many options), dump the generated dependency chains or loop schedules, then visualize the output as a 3D model. If there's no alternative solution, I'm going to take a try at this approach.

Solution

After looking at this ConeTur visualization again today, I finally found the key that leads to its solution: Only 2D spacetime can be tessellated exclusively with diamonds. In 3D spacetime, at every pass, only exactly one diamond can be created, not four. The remaining three all degenerate to trapezoids. Thus, the best one can do is creating only one diamond per iteration - but this still gives a slight speedup over pure trapezoid tiling.

First, recall the 2D spacetime of 1D diamond tiling at startup:

Here's the algorithm for startup. It's based the same Cartesian product trick. I'll show how timestep misalignment is handled later:

Combine all X mountains with Y mountains, process all in parallel.
Combine all X mountains with Y diamonds, process all in parallel. In the diagrams, I use the color grey to mark the skipped regions. As one can see: due to missing dependencies, Y diamonds degenerate into lower-half diamonds (valleys).
Combine all X diamonds with Y mountains, process all in parallel. due to missing dependencies, X diamonds degenerate into lower-half diamonds (valleys).
Combine all X diamonds with Y diamonds, process all in parallel. This is the only timestep that allows the creation and computation of a full diamond.

As one can see, 3 of these 4 steps have mismatched timesteps, which means those diamonds degenerates back to valleys. Only the last step (diamond, diamond) remains a diamond.

When the diamond tiling finishes initialization, as usual, the original mountains now takes valleys' roles, so they too become diamonds. After this process, the 2D spacetime diagram would become:

Here's the algorithm for continuing the calculations: 1. Combine all X fast diamonds with Y fast diamonds, process all in parallel. Note that the lowest point of the diamond is also the top of the previous mountain and should be skipped (this is true for pure 1D as well). (fast diamond, fast diamond)

Combine all X fast diamond with Y slow diamonds. process all in parallel. For the fast diamond, skip the top half due to missing dependencies, so it degenerates to a lower-half diamond (valleys). For the slow diamond, skip and ignore the bottom half.

The bottom half of the slow diamond is skipped. Why? By symmetry. Note that these slow diamonds are not new ones, but are the exact the same one that we've already partially processed in the last iteration, as the (mountain, valley) case, shown below.

Combine all X slow diamonds with Y fast diamonds, process all in parallel. For the Y fast diamond, skip the top half due to missing dependencies, so it degenerates to a lower-half diamond (valleys). For the X slow diamond, skip and ignore the bottom half.

By symmetry. the skipped lower half of the X slow diamonds have already been partially processed in the last iteration, as the (valley, mountain) case, shown below.

Combine all X slow diamonds with Y slow diamonds, do not process anything. We've already processed these slow diamonds in the previous iteration!

Further calculations are possible by applying the algorithm for the symmetrical case. As we can see here, in each iteration, one (and only one) new group of diamonds is created. In the second and all future iterations (excluding the last truncated one before stopped), the pairs from the original Cartesian product have been reduced from 4 to 3.

After explaining the theory of operation, the ConeTur visualization is now perfectly clear.

All 4 combinations with the exception of the last one degenerate back into trapezoids. Thus, the main difficulty that prevents the generalization of diamond tiling to 2D, which is those "diamonds misaligned in time" is avoided and shown to be non-existent - they will not be created to begin with due to trapezoid degeneration. As a result, at every pass, exactly only one diamond can be created, not four, the remaining three can only be trapezoids. Thus, and the best one can do is creating only one diamond per iteration, all the time.

Thus, the best one can do is creating only one diamond per iteration - but this still gives a slight speedup over pure trapezoid tiling. For higher dimension, although it will become even more complicated, but assuming there's no control overhead (there will be), the savings can still be significant. This kind of behavior reminds me of Karatsuba multiplication, which also does 4 multiplications in 3 steps.

This must also why the ConeTur algorithm was soon abandoned by the Russian research group at Keldysh Institute of Applied Mathematics - straight generalization of diamond tiling directly to 2D and higher dimensions only has limited gain - which was the motivation of later improvements. For example, ConeFold appears to be their second-generation algorithm, and seems to be a modification of the ConeTur algorithm by recombining the shapes. Meanwhile, DiamondTorre and DiamondCandy are completely new and use 2D as their starting point, thus are more efficient.

Now the real problem is understanding ConeFold, which is not within the scope of this question.

2D diamond tiling visualization

Update: As usual in life, you can't find something when you needed it the most before it pops up later when you don't need it anymore... Just after solving the problem myself, while looking for something else, I've found the detailed step-by-step description of the solution in a research paper, even with 3D diagrams.

Walker, Anthony S., and Kyle E. Niemeyer. "Applying the swept rule for solving two-dimensional partial differential equations on heterogeneous architectures." Mathematical and Computational Applications 26.3 (2021): 52. https://www.mdpi.com/2297-8747/26/3/52

But at least I know that my solution was correct and now there's a reference for future readers... Sigh...

Insight behind the DiamondTorre algorithm

I'm finally able to work out a preliminary understanding of the key idea behind the mysterious DiamondTorre spacetime decomposition algorithm today.

Unlike conventional diamond blocking, the DiamondTorre algorithm is a native 2D algorithm, not a direct generalization of the 1D algorithm. Thus, we need to apply diamond tiling to the XY plane (not the XT or YT plane) - that is, the simulation space itself, see the figure below (only two layers of tiles are shown). The current timestep of all cells is also marked in the figure.

After applying diamond tiling, the processing "cursor" moves the wavefront from right to left. In each stage, we select a column of diamond tiles on the same Y axis - different diamond tiles can be calculated in parallel, similar to the rules of 1D diamond tiling. For the first wavefront, we would select the green, red, pink and purple tiles.

We calculate each selected tile one timestep ahead. After finishing this step, now it's the key and the algorithm: shift the tile on one unit to the right on the X axis (truncation is also allowed), then one would find that calculating a new timestep is now possible within these tiles - the shifted tiles have perfectly avoided the grid cells that we cannot calculate due to missing dependencies, leaving only good cells with complete dependencies.

Similarly, the missed grid cells that cannot be calculated in the current wavefront will be "fixed" after a new wavefront is calculated and shifted (we're using double buffering with two timesteps, so it's legal to access a cell on the same timestep or a cell with exactly one timestep behind us). For example, in the second wavefront, we've selected the blue, purple and yellow tile.

More and more tile shifts are possible when the wavefront goes further to the left.

In 2D space, the partitioning of space to diamond tiles only determines their initial position, during the execution it's needed to keep moving them the right. Meanwhile, in 3D XYT spacetime, we can see that the process of sweeping a diamond tiles over time creates a leaning tower (Torre). The horizontal motion of the tile is just the projection of 3D spacetime to 2D space when the towers are viewed from the top.

Thus, the DiamondTorre algorithm can be seen as a creative combination of two algorithms: 1D diamond tiling, and also 1D parallelogram tiling with wavefront parallelis - this is the key insight to understand it. This is the best explanation I currently have. Refer to the DiamondTorre references at the end of the post for more information.

We can also see that DiamondTorre is inherently a 2D algorithm, and full parallelism only exists on the Y axis (wavefront parallelism on the X axis is still possible, at the expense of pipelined startup). Thus, the best parallel scalability and efficiency can only be maximized when one dimension is significantly longer than other dimensions - this is exactly the HPC cluster test cases showed in the papers.

To overcome this problem, the team later proposed an even more powerful algorithm named DiamondCandy, which is a native 3D algorithm, and is even more difficult to understand. If you're reading this and you're great at solid geometry, an tutorial-format explanation similar to my answer would be greatly appreciated...

References

For proper attribution and also for everyone else who has found this post via a Web search, here's all the papers I've referenced to while writing this question and answer (and question has too many characters already).

Basis of this article

Fukaya, Takeshi, and Takeshi Iwashita. "Time-space tiling with tile-level parallelism for the 3D FDTD method." Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. 2018. https://www.capsl.udel.edu/pub/doc/memos/memo091.pdf

Introduction to Stencils

Stencil Pattern, Parallel Computing CIS 410/510, Department of Computer and Information Science, University of Oregon https://www.slidestalk.com/u3809/08StencilPattern

Theory and Applications

Korch, Matthias, and Tim Werner. "An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUs." Concurrency and Computation: Practice and Experience 33.11 (2021): e6016. https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.6016
Malas, Tareq M., et al. "Multidimensional intratile parallelization for memory-starved stencil computations." ACM Transactions on Parallel Computing (TOPC) 4.3 (2017): 1-32. https://arxiv.org/pdf/1510.04995
Strout et al., Practical Diamond Tiling for Stencil Computations Using Chapel Iterators https://chapel-lang.org/CHIUW/2015/talks/CHIUW2015-Strout.pdf
Zhao, Jie, and Albert Cohen. "Flextended tiles: A flexible extension of overlapped tiles for polyhedral compilation." ACM Transactions on Architecture and Code Optimization (TACO) 16.4 (2019): 1-25. https://dl.acm.org/doi/pdf/10.1145/3369382
Bandishti, Vinayaka, Irshad Pananilath, and Uday Bondhugula. "Tiling stencil computations to maximize parallelism." SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012. https://scholar.archive.org/work/hyl6ee4nv5bmbp5xafc4ayxvme/access/wayback/http://www.csa.iisc.ernet.in:80/~cplse/papers/ukbr-sc-12-1.pdf
Tobias Grosser, Sven Verdoolaege, Albert Cohen, P. Sadayappan. The Promises of Hybrid Hexagonal/Classical Tiling for GPU. [Research Report] RR-8339, INRIA. 2013. al-00848691 https://inria.hal.science/hal-00848691/document

Thesis

De la Cruz, Raúl. "Leveraging performance of 3D finite difference schemes in large scientific computing simulations." (2015). https://upcommons.upc.edu/bitstream/handle/2117/95958/TRDLC1de1.pdf
Duer, Nicholas, Paul Kelly, and George Bisbas. "Temporal Tiling for Distributed Parallel Solution of Partial Differential Equations." (2023). https://www.imperial.ac.uk/media/imperial-college/faculty-of-engineering/computing/public/2223-ug-projects/Tiling-in-time---improving-data-locality-in-computational-science-applications,-in-4D.pdf
Wang, Hengjie. Algorithm Design for High-Performance CFD Solvers on Structured Grids. University of California, Irvine, 2021. https://escholarship.org/content/qt2ct7j3j0/qt2ct7j3j0.pdf

Case Studies

Orozco, Daniel, and Guang Gao. "Diamond tiling: A tiling framework for time-iterated scientific applications." CAPSL Technical Memo 091, Tech. Rep. (2009). https://www.capsl.udel.edu/pub/doc/memos/memo091.pdf
Malas, Tareq M., et al. "Optimization of an electromagnetics code with multicore wavefront diamond blocking and multi-dimensional intra-tile parallelization." 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2016. https://arxiv.org/pdf/1510.05218
Akbudak, Kadir, et al. "Asynchronous computations for solving the acoustic wave propagation equation." The International Journal of High Performance Computing Applications 34.4 (2020): 377-393. https://repository.kaust.edu.sa/bitstream/handle/10754/662949/hpc-19-0066.pdf?sequence=1&isAllowed=y

Mysterious Russian Magic

These results came from the Russian Keldysh Institute of Applied Mathematics. Based on their own design methodology called Local Recursive non-Local Asynchronous algorithms (LRnLA) and also their intuition of solid geometry, they've designed and programmed multiple spacetime domain decomposition algorithms by hand, all are based on the tessellation of 3D and 4D spacetime. The latest two generations of algorithms are 2D1T DiamondTorre and 3D1T DiamondCandy. As far as I know, these algorithms are original and differs greatly from everyone else. Not only that, they seem to be able to solve any application using the same logic, be it - FDTD for electromagnetism, LBM for fluid mechanics, discontinuous Galerkin method, Runge-Kutta method... You can find them by searching for the keyword LRnLA.

I can't understand even one article. Here are all the descriptions I could find about these algorithms. If anyone understands it, please write an easy-to-follow tutorials in the style of my article and it would be greatly appreciated.

DiamondCandy (latest)

Perepelkina A.Y., Levchenko V.D. The DiamondCandy Algorithm for Maximum Performance Vectorized Cross-Stencil Computation Keldysh Institute Preprints. 2018. No. 225. 23 p. doi:10.20948/prepr-2018-225-e, http://library.keldysh.ru/preprint.asp?id=2018-225&lg=e
Perepelkina, Anastasia, Vadim Levchenko, and Sergey Khilkov. "The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes." The Journal of Supercomputing 75 (2019): 7778-7789. https://sci-hub.se/10.1007/s11227-018-2461-z
Perepelkina, A., and V. D. Levchenko. "Recursive DiamondCandy: non-memory-bound LRnLA algorithm for 3D cross stencil calculations on CUDA GPU." Proceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications. 2020. https://sci-hub.se/10.1145/3407947.3407951

DiamondTorre

Vadim Levchenko, Anastasia Perepelkina, “DiamondTile Algorithm for High-Performance Wave Modeling“, Keldysh Institute of Apllied Mathematics RAS, GTC 2015 https://on-demand.gputechconf.com/gtc/2015/presentation/S5315-Anastasia-Perepelkina.pdf
Levchenko, Vadim, Anastasia Perepelkina, and Andrey Zakirov. "DiamondTorre algorithm for high-performance wave modeling." Computation 4.3 (2016): 29. https://www.mdpi.com/2079-3197/4/3/29/pdf
Zakirov A.V., Levchenko V.D., Perepelkina A.Yu., Zempo Yasunari. High performance FDTD code implementation for GPGPU supercomputers, Keldysh Institute Preprints. 2016. No. 44. 22 p. doi:10.20948/prepr-2016-44-e, http://library.keldysh.ru/preprint.asp?id=2016-44&lg=e

Others

Perepelkina, Anastasia, and Vadim Levchenko. "LRnLA algorithm ConeFold with non-local vectorization for LBM implementation." Russian Supercomputing Days. Cham: Springer International Publishing, 2018. 101-113. https://2018.russianscdays.org/files/pdf18/274.pdf
Levchenko, V. D., and A. Y. Perepelkina. "Locally recursive non-locally asynchronous algorithms for stencil computation." Lobachevskii Journal of Mathematics 39 (2018): 552-561. https://sci-hub.se/10.1134/S1995080218040108
Levchenko, Vadim, and Anastasia Perepelkina. "The diamondtetris algorithm for maximum performance vectorized stencil computation." Parallel Computing Technologies: 14th International Conference, PaCT 2017, Nizhny Novgorod, Russia, September 4-8, 2017, Proceedings 14. Springer International Publishing, 2017. https://www.researchgate.net/profile/Anastasia-Perepelkina/publication/318757233_The_DiamondTetris_Algorithm_for_Maximum_Performance_Vectorized_Stencil_Computation/links/5c4b6852299bf12be3e30ee6/The-DiamondTetris-Algorithm-for-Maximum-Performance-Vectorized-Stencil-Computation.pdf
Zakirov, Andrey Vladimirovich, and Vadim Dmitrievich Levchenko. "The effective algorithm for 3D modeling of electromagnetic waves' propagation through photonic crystals." Keldysh Institute Preprints 21 (2008): 1-20. https://library.keldysh.ru/preprint.asp?id=2008-21