Search code examples
sqlpostgresqlrelational-databasesparse-matrix

Representing Sparse Data in PostgreSQL


What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:

  1. Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.

  2. Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.

I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.

Are there any other solutions? Which one should I use?


Solution

  • A few solutions spring to mind,

    1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying

    2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table

    3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables

    4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)

    5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated

    6) live with lots of empty fields