sql database-design database-schema database-normalization

Many-to-Many Relationship Against a Single, Large Table

I have a geometric diagram the consists of 5,000 cells, each of which is an arbitrary polygon. My application will need to save many such diagrams.

I have determined that I need to use a database to make indexed queries against this map. Loading all the map data is far too inefficient for quick responses to simple queries.

I've added the cell data to the database. It has a fairly simple structure:

CREATE TABLE map_cell (
map_id INT  NOT NULL ,
cell_index INT  NOT NULL ,

...    
PRIMARY KEY (map_id, cell_index)
)

5,000 rows per map is quite a few, but queries should remain efficient into the millions of rows because the main join indexes can be clustered. If it gets too unwieldy, it can be partitioned on map_id bounds. Despite the large number of rows per map, this table would be quite scalable.

The problem comes with storing the data that describes which cells neighbor each other. The cell-neighbor relationship is a many-to-many relationship against the same table. There are also a very large number of such relationship per map. A normalized table would probably look something like this:

CREATE TABLE map_cell_neighbors (
id INT  NOT NULL AUTO INCREMENT ,
map_id INT  NOT NULL ,
cell_index INT  NOT NULL ,
neighbor_index INT ,
...
INDEX IX_neighbors (map_id, cell_index)
)

This table requires a surrogate key that will never be used in a join, ever. Also, this table includes duplicate entries: if cell 0 is a neighbor with cell 1, then cell 1 is always a neighbor of cell 0. I can eliminate these entries, at the cost of some extra index space:

CREATE TABLE map_cell_neighbors (
id INT  NOT NULL AUTO INCREMENT ,
map_id INT  NOT NULL ,
neighbor1 INT  NOT NULL ,
neighbor2 INT  NOT NULL ,
...
INDEX IX_neighbor1 (map_id, neighbor1),
INDEX IX_neighbor2 (map_id, neighbor2)
)

I'm not sure which one would be considered more "normalized", since option 1 includes duplicate entries (including duplicating any properties the relationship has), and option 2 is some pretty weird database design that just doesn't feel normalized. Neither option is very space efficient. For 10 maps, option 1 used 300,000 rows taking up 12M of file space. Option 2 was 150,000 rows taking up 8M of file space. On both tables, the indexes are taking up more space than the data, considering the data should be about 20 bytes per row, but it's actually taking 40-50 bytes on disk.

The third option wouldn't be normalized at all, but would be incredibly space- and row-efficient. It involves putting a VARBINARY field in map_cell, and storing a binary-packed list of neighbors in the cell table itself. This would take 24-36 bytes per cell, rather than 40-50 bytes per relationship. It would also reduce the overall number of rows, and the queries against the cell table would be very fast due to the clustered primary key. However, performing a join against this data would be impossible. Any recursive queries would have to be done one step at a time. Also, this is just really ugly database design.

Unfortunately, I need my application to scale well and not hit SQL bottlenecks with just 50 maps. Unless I can think of something else, the latter option might be the only one that really works. Before I committed such a vile idea to code, I wanted to make sure I was looking clearly at all the options. There may be another design pattern I'm not thinking of, or maybe the problems I'm foreseeing aren't as bad as they appear. Either way, I wanted to get other people's input before pushing too far into this.

The most complex queries against this data will be path-finding and discovery of paths. These will be recursive queries that start at a specific cell and that travel through neighbors over several iterations and collect/compare properties of these cells. I'm pretty sure I can't do all this in SQL, there will likely be some application code throughout. I'd like to be able to perform queries like this of moderate size, and get results in an acceptable amount of time to feel "responsive" to user, about a second. The overall goal is to keep large table sizes from causing repeated queries or fixed-depth recursive queries from taking several seconds or more.

Solution

Not sure which database you are using, but you seem to be re-inventing what a spatial enabled database supports already.

If SQL Server, for example, is an option, you could store your polygons as geometry types, use the built-in spatial indexing, and the OGC compliant methods such as "STContains", "STCrosses", "STOverlaps", "STTouches".

SQL Server spatial indexes, after decomposing the polygons into various b-tree layers, also uses tessellation to index which neighboring cells a given polygon touches, at a given layer of the tree-index.

There are other mainstream databases which support spatial types as well, including MySQL