Search code examples
csparse-matrix

Optimal space-efficient storage scheme for Sparse Binary Matrix with runs


In the context of implementing a "search window" for fast dynamic time warping in C, I need to build a structure that represents a sparse binary matrix with a very particular structure.

This matrix has some strong structural constraints on it: specifically, there is only one "run" of non-zero data per row, although that run is of variable length. The value of the starting cell's column index and ending cell's column index increases monotonically as the row increases:

Example valid matrices (* = filled, . = unfilled)

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *


  0 1 2 3 4 5
0 * * * * . .
1 . * * * * .
2 . . * * * .
3 . . . . * *
4 . . . . . *

Example invalid matrices:

  0 1 2 3 4 5
0 * * * . * * <-- more than one continuous run
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 * * * * * * <-- start index not monotonically increasing
4 . . * * * *


  0 1 2 3 4 5
0 * * * . . .
1 . * * * * .
2 . * * * . . <-- end index not monotonically increasing
3 . . * * * *
4 . . * * * *

These sparse matrices are mostly populated on the diagonals, but they are not square, so I don't know if I should be using "jagged diagonal storage".

My current approach is to store (start, end) column indices for each row: i.e.

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *

is (logically) stored as

(0, 0) --> (0, 2)
(1, 1) --> (1, 3)
(2, 1) --> (2, 3)
(3, 2) --> (3, 5) 
(4, 2) --> (4, 5)

(although these are internally stored as ravelled indices, i.e.)

(0 * num_cols + 0) --> (0 * num_cols + 2)
(1 * num_cols + 1) --> (1 * num_cols + 3)

so the final structure ends up looking like

[0, 2, 7, 9, 13, 15, 20, 23, 26, 29]

Then, this structure is delta encoded like

[first_value, offset_1, offset_2, ...]
[0, 2, 5, 2, 4, 2, 5, 3, 3, 3] 

Because these delta values are small and positive, we can take advantage of packing them into some flavor of VARINT structure.

First question: do these matrices have a well-known name? Can't seem to find much in the literature.

Second question: is this storage scheme taking advantage of all the strong properties of the matrices? Can I abuse the constraints in order to store less data somehow?

I have read a few documents describing sparse matrix storage for general sparse matrices, but it strikes me that this special case structure might have a special case optimal storage / encoding scheme.


Solution

  • I think your approach of considering the absolute index values (and then using delta encoding) already yields good results, but it does not make use of the monotonically incrementing start/end-index constraint.

    Considering a storage structure that stores - relative to an absolute start and end index - only the increment of the start and end index should - in average - yield numbers with a smaller range (and consequently a shorter representation).

    Translating your first sample into this structure would look as following:

    (0/2) - (1/1)(0/0)(1/2)(0/0)
    

    where the first pair (0/2) represents the absolute start/end - indices, and the other pairs stand for the increment of start and end for each line. As these numbers use a smaller range, a variable length integer encoding should yield a better compression.

    Consider, for example, the following (simple) integer encoding:

    0 .. 00
    1 .. 01
    2 .. 100
    3 .. 101
    4 .. 1100
    5 .. 1101
    6 and higher: 111 + 16 bit integer
    

    With this encoding, the first sample translates into the following bit stream comprising 22 bits:

    00 100 01 01 00 00 01 100 00 00
    

    This approach works the better the smaller the increments are; When considering the increments row-wise, this should be the case if the matrix has more rows than columns.

    Just an idea for optimizing matrixes with more columns than rows: one could think of using the same storage approach column-wise; I think (yet without mathematical proof) that row-wise monotonically increments with single runs imply also column-wise monotonically increments with single runs.