Search code examples
pythonsqlsqlitecompression

How to deal with repetition of a column in Sqlite? By using compression?


I know (from the answers of this question) that Sqlite by default doesn't enable compression. Is it possible to enable it, or would this require another tool? Here is the situation:

I need to add millions of rows in a Sqlite database. The table contains a description column (~500 char on average), and on average, each description is shared by, say, 40 rows, like this:

id    name    othercolumn    description 
-------------------------------------------------
1     azefds  ...            This description will be the same for probably 40 rows
2     tsdyug  ...            This description will be the same for probably 40 rows
...
40    wxcqds  ...            This description will be the same for probably 40 rows
41    azeyui  ...            This one is unique
42    uiuotr  ...            This one will be shared by 60 rows
43    poipud  ...            This one will be shared by 60 rows
...
101   iuotyp  ...            This one will be shared by 60 rows
102   blaxwx  ...            Same description for the next 10 rows
103   sdhfjk  ...            Same description for the next 10 rows
...

Question:

  • Would you just insert rows like this, and enable a compression algorithm of the DB?

    Pros:

    • you don't have to deal with 2 tables

or

  • Would you use 2 tables?

      id    name    othercolumn    descriptionid
      -------------------------------------------------
      1     azefds  ...            1
      2     tsdyug  ...            1    
      ...
      40    wxcqds  ...            1
      41    azeyui  ...            2
      ...
    
      id    description
      -------------------------------------------------
      1     This description will be the same for probably 40 rows
      2     This one is unique
    

    Cons:

    • instead of the simple select id, name, description from mytable from solution #1, we have to use 2 tables, and multiple queries to retrieve data. Or maybe is it possible to do it without a complex query, but with a clever query with union or merge or anything like this?

    • also, if some description are no longer used, we have to periodically clean up the 2nd table to remove obsolete entries


Solution

  • Using multiple tables will not only prevent inconsistency, and take less space, but may also be faster, even if multiple/more complex queries are involved (precisely because it involves moving less data around). Which you should use depends on which of those characteristics are most important to you.

    A query to retrieve the results when you have 2 tables would look something like this (which is really just a join between the two tables):

    select table1.id, table1.name, table1.othercolumn, table2.description
    from table1, table2
    where table1.descriptionid=table2.id