Search code examples
c#databasesqliterangesparse-matrix

How to store a sparse boolean vector in a database?


Let's say I have a book with ~2^40 pages. Each day, I read a random chunk of contiguous pages (sometimes including some pages I've already read). What's the smartest way to store and update the information of "which pages I've read" in a (SQLite) database ?

My current idea is to store [firstChunkPage, lastChunkPage] entries in a table, but I'm not sure about how to update this efficiently.

  • Should I first check for every possible overlaps and then update ?
  • Should I just insert my new range and then merge overlapping entries (perhaps multiple times because multiple overlaps can occur ?) ? I'm not sure about how to build such a SQL query.

This looks like a pretty common problem, so I'm wondering if anyone knows a 'recognized' solution for this.

Any help or idea is welcome !

EDIT : The reading isn't actually random, the number of chunks is expected to be pretty much constant and very small compared to the number of pages.


Solution

  • Your idea to store ranges of (firstChunkPage, lastChunkPage) pairs should work if data is relatively sparse.

    Unfortunately, queries like you mentioned:

    SELECT count(*) FROM table
    WHERE firstChunkPage <= page AND page <= lastChunkPage
    

    cannot work effectively, unless you use spatial indexes.

    For SQLite, you should use R-Tree module, which implements support for this kind of index. Quote:

    An R-Tree is a special index that is designed for doing range queries. R-Trees are most commonly used in geospatial systems where each entry is a rectangle with minimum and maximum X and Y coordinates. ... For example, suppose a database records the starting and ending times for a large number of events. A R-Tree is able to quickly find all events, for example, that were active at any time during a given time interval, or all events that started during a particular time interval, or all events that both started and ended within a given time interval.

    With R-Tree, you can very quickly identify all overlaps before inserting new range and replace them with new combined entry.

    To create your RTree index, use something like this:

    CREATE VIRTUAL TABLE demo_index USING rtree(
        id, firstChunkPage, lastChunkPage
    );
    

    For more information, read documentation.