Search code examples
algorithmmatrixsparse-matrix

Storing and reading n-dimensional sparse matrices


The Wikipedia page on the Yale format for storing sparse matrices covers 2D matrices, but what about higher dimensions? Are there any algorithms like the Yale format (or extensions of?) that can store n>2-dimensional sparse matrices?—And by that, I mean an algorithm that stores in some sort of compressed fashion, because of course you could just store the raw matrices.

Most searches for this topic seem to turn up specific language implementations, which are useless to me because I'm searching for an adaptable algorithm.


Solution

  • We can imagine two extremes:

    • At one extreme, you assume that a large proportion of elements are nonzero, so you just store the value of every single element (not optimizing for sparsity), don't need to store any extra information about the positions of elements, and get a total size of mn.
    • At the other extreme, you assume that a small proportion of elements are nonzero, so you store only the nonzero elements, but now you also need to store their positions (row and column) and get a total size of 3‌x (where x is the number of nonzero values).

    Yale format is a compromise between these two, where we assume that the number of nonzero elements is larger than the number of rows,1 but small compared to the total number of elements; so we store a pointer2 for each row, and then store each element's value and position within its row, giving a total size of m + 2‌x.

    With more than two dimensions, you can continue to make the same compromise; you just need to choose the right dimension, or right combination of dimensions, so that the number of nonzero elements is less than the number of rows/hyper-rows/etc.

    For example, for an m×n×p array indexed by (ijk), you have two compromise options:3

    • If the number of nonzero elements is expected to more than a small fraction of mn, but much smaller than mnp, then for each i and j, you can store a pointer to the one-dimensional "slice" of elements that have that i and j. For each nonzero element in that slice, you store its value and its k, for a total size of mn + 2‌x.
    • If the number of nonzero elements is expected to be larger than a small fraction of m, but much smaller than mn, then for each i, you can store a pointer to the two-dimensional "slice" of elements that have that i. For each nonzero element in that slice, you store its value, its j, and its k, for a total size of m + 3‌x.

    More broadly, a d-dimensional array will have d−1 compromise options . . . or a total of 2d options, if you count each possible choice of dimensions separately.


    Notes:

    1. I write "rows" here, but you can also do the same thing column-wise instead.
    2. Not necessarily a "pointer" in the very strict sense; you can just store a total count of all nonzero elements in all previous rows.
    3. Not counting the fact that you can choose to use different dimensions; much as you can use Yale format on either rows or columns, you can use these options on any choice of dimensions.