Search code examples
sqliterelational-databaselarge-data

Organizing tables with data-heavy rows to optimize access times


I am working with a sqlite3 database of around 70 gigabytes right now. This db has three tables: one with about 30 million rows, and two more with ~150 and ~300 million each, with each table running from 6-11 columns.

The table with the fewest rows is consuming the bulk of the space, as it contains a raw data column of zipped BLOBs, generally running between 1 and 6 kilobytes per row; all other columns in the database are numeric, and the zipped data is immutable so inefficiency in modification is not a concern.

I have noticed that creating indexes on the numeric columns of this table:

[15:52:36] Query finished in 723.253 second(s).

takes several times as long as creating a comparable index on the table with five times as many rows:

[15:56:24] Query finished in 182.009 second(s).
[16:06:40] Query finished in 201.977 second(s).

Would it be better practice to store the BLOB data in a separate table to access with JOINs? The extra width of each row is the most likely candidate for the slow scan rate of this table.

My current suspicions are:

  1. This is mostly due to the way data is read from disk, making skipping medium-sized amounts of data impractical and yielding a very low ratio of usable data per sector read from the disk by the operating system, and
  2. It is therefore probably standard practice that I did not know as a relative newcomer to relational databases to avoid putting larger, variable-width data into the same table as other data that may need to be scanned without indices

but I would appreciate some feedback from someone with more knowledge in the field.


Solution

  • In the SQLite file format, all the column values in a row are simply appended together, and stored as the row value. If the row is too large to fit into one database page, the remaining data is stored in a linked list of overflow pages.

    When SQLite reads a row, it reads only as much as needed, but must start at the beginning of the row.

    Therefore, when you have a blob (or a large text value), you should move it to the end of the column list so that it is possible to read the other columns' values without having to go through the overflow page list:

    CREATE TABLE t (
        id INTEGER PRIMARY KEY,
        a INTEGER,
        [...],
        i REAL,
        data BLOB NOT NULL,
    );
    

    With a single table, the first bytes of the blob value are still stored inside the table's database pages, which decreases the number of rows that can be stored in one page.

    If the other columns are accessed often, then it might make sense to move the blob to a separate table (a separate file should not be necessary). This allows the database to go through more rows at once when reading a page, but increases the effort needed to look up the blob value.