Optimal data structure for a table

Our team is working on implementation of the table widget for mobile platform (one of the application is mobile office like MS Excel).

We need to optimize the data structure for storing table data (the simple 2-d array is used).

Could you, please, suggest the optimal data structure for storing table data. Below are some of requirements for the data structure:

the size of the table can be up to 2^32 x 2^32;
majority of table cells are empty (i.e. the table is sparse), so is is desirable not to store data for empty cells;
interface of the data structure should support inserting/removing rows and columns;
data structure should allow to iterate through non-empty cells in forward and backward direction;
cells of the table can be merged (i.e. one cell can span more than one row and/or column).

Solution

After thinking more about the problem with the row/column insertion/deletion, I've come up with something that looks promising.

First, create and maintain 2 sorted data structures (e.g. search trees) containing all horizontal and all vertical indices that have at least one non-empty cell.

For this table:

 ABCDE
1
2*
3 %  #
4
5   $

You'd have:

A,B,D,E - used horizontal indices
2,3,5 - used vertical indices

Store those A,B,D,E,2,3,5 index values inside some kind of a node in the 2 aforementioned structures such that you can link something to it knowing the node's address in memory (again, a tree node fits perfectly).

In each cell (non-empty) have a pair of links to the index nodes describing its location (I'm using & to denote a link/reference to a node):

*: &2,&A
%: &3,&B
#: &3,&E
$: &5,&D

This is sufficient to define a table.

Now, how do we handle row/column insertion? We insert the new row/column index into the respective (horizontal or vertical) index data structure and update the index values after it (=to the right or below). Then we add new cells for this new row/column (if any) and link them to the appropriate index nodes.

For example, let's insert a row between rows 3 and 4 and add a cell with @ in it at 4C (in the new row):

 ABCDE
1
2*
3 %  #
4  @   <- new row 4
5      <- used to be row 4
6   $  <- used to be row 5

Your index structures are now:

A,B,C(new),D,E - used horizontal indices
2,3,4(new),6(used to be 5) - used vertical indices

The cells now link to the index nodes like this:

*: &2,&A - same as before
%: &3,&B - same as before
#: &3,&E - same as before
@: &4,&C - new cell linking to new index nodes 4 and C
$: &6,&D - used to be &5,&D

But look at the $ cell. It still points to the same two physical nodes as before, it's just that the vertical/row node now contains index 6 instead of index 5.

If there were 100 cells nodes below the $ cell, say occupying only 5 non-empty rows, you'd need to update only 5 indices in the row/vertical index data structure, not 100.

You can delete rows and columns in a similar fashion.

Now, to make this all useful, you also need to be able to locate every cell by its coordinates.

For that you can create another sorted data structure (again, possibly a search tree), where every key is a combination of the addresses of the index nodes and the value is the location of cell data (or the cell data itself).

With that, if you want to get to cell 3B, you find the nodes for 3 and B in the index data structures, take their addresses &3 and &B, combine them into &3*2³²+&B and use that as a key to locate the % cell in the 3rd data structure I've just defined. (Note: 2³² is actually 2^{pointer size in bits} and can vary from system to system.)

Whatever happens to other cells, the addresses &3 and &B in the %'s cell links will remain the same, even if the indices of the % cell change from 3B to something else.

You may develop iteration on top of this easily.

Merging should be feasible too, but I haven't focused on it.