In the context of implementing a "search window" for fast dynamic time warping in C, I need to build a structure that represents a sparse binary matrix with a very particular structure.
This matrix has some strong structural constraints on it: specifically, there is only one "run" of non-zero data per row, although that run is of variable length. The value of the starting cell's column index and ending cell's column index increases monotonically as the row increases:
Example valid matrices (* = filled, . = unfilled)
0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *
0 1 2 3 4 5
0 * * * * . .
1 . * * * * .
2 . . * * * .
3 . . . . * *
4 . . . . . *
Example invalid matrices:
0 1 2 3 4 5
0 * * * . * * <-- more than one continuous run
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *
0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 * * * * * * <-- start index not monotonically increasing
4 . . * * * *
0 1 2 3 4 5
0 * * * . . .
1 . * * * * .
2 . * * * . . <-- end index not monotonically increasing
3 . . * * * *
4 . . * * * *
These sparse matrices are mostly populated on the diagonals, but they are not square, so I don't know if I should be using "jagged diagonal storage".
My current approach is to store (start, end) column indices for each row: i.e.
0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *
is (logically) stored as
(0, 0) --> (0, 2)
(1, 1) --> (1, 3)
(2, 1) --> (2, 3)
(3, 2) --> (3, 5)
(4, 2) --> (4, 5)
(although these are internally stored as ravelled indices, i.e.)
(0 * num_cols + 0) --> (0 * num_cols + 2)
(1 * num_cols + 1) --> (1 * num_cols + 3)
so the final structure ends up looking like
[0, 2, 7, 9, 13, 15, 20, 23, 26, 29]
Then, this structure is delta encoded like
[first_value, offset_1, offset_2, ...]
[0, 2, 5, 2, 4, 2, 5, 3, 3, 3]
Because these delta values are small and positive, we can take advantage of packing them into some flavor of VARINT structure.
First question: do these matrices have a well-known name? Can't seem to find much in the literature.
Second question: is this storage scheme taking advantage of all the strong properties of the matrices? Can I abuse the constraints in order to store less data somehow?
I have read a few documents describing sparse matrix storage for general sparse matrices, but it strikes me that this special case structure might have a special case optimal storage / encoding scheme.
I think your approach of considering the absolute index values (and then using delta encoding) already yields good results, but it does not make use of the monotonically incrementing start/end-index constraint.
Considering a storage structure that stores - relative to an absolute start and end index - only the increment of the start and end index should - in average - yield numbers with a smaller range (and consequently a shorter representation).
Translating your first sample into this structure would look as following:
(0/2) - (1/1)(0/0)(1/2)(0/0)
where the first pair (0/2)
represents the absolute start/end - indices, and the other pairs stand for the increment of start and end for each line.
As these numbers use a smaller range, a variable length integer encoding should yield a better compression.
Consider, for example, the following (simple) integer encoding:
0 .. 00
1 .. 01
2 .. 100
3 .. 101
4 .. 1100
5 .. 1101
6 and higher: 111 + 16 bit integer
With this encoding, the first sample translates into the following bit stream comprising 22 bits:
00 100 01 01 00 00 01 100 00 00
This approach works the better the smaller the increments are; When considering the increments row-wise, this should be the case if the matrix has more rows than columns.
Just an idea for optimizing matrixes with more columns than rows: one could think of using the same storage approach column-wise; I think (yet without mathematical proof) that row-wise monotonically increments with single runs imply also column-wise monotonically increments with single runs.