Search code examples
postgresqldata-structurescustomizationindexingb-tree

Hybrid "Index-like" btree structure - can PostgreSQL do this?


I am new to PostgreSQL. I have a very unusual requirement for a hybrid database I need to build. From modules I've seen, it seems to me that the following is possible.

I need to be able to add key - [values] into an index without actually adding data to a table. Simply put, I need a key-[values] store, ideally as a btree (lookup speed). An index structure is ideal. Perhaps another structure will do this.

To be very specific, I wish to store something like:

KEY     [IDs]
Blue    10, 20, 23, 47
Green   5, 12, 40

I don't want the overhead of storing this data AND indexing it. I just need the data "indexed but not stored" so to speak.

Equally important is the ability to query these structures and get the data (IDs) out, and be able to perform INTERSECTS etc. on the IDs, and IN, BETWEEN, =, etc. on the keys.

As you can probably guess, the end goal is a final list of IDs, which would then be sent to the client, and looked up at will.

EDIT

What I don't want is to record the key for every value. Using the example above, I don't want to store {Blue, 10}, {Blue, 20} etc. I want to store {Blue, [10, 20, 23, 47]}.

If I store this as a traditional table, I cannot see a way around this duplicate problem.

Looking again at Blue, [10, 20, 23, 47]}, this is technically nothing more than a single btree, where the IDs (10, 20, 23, 47) are marked as values, and the parent key "Blue" is marked as a key.

Since this data type mismatch could be messy in a single tree, I believe the ideal solution is "[btrees] in a btree", where "btree" is the key, and [btrees] is a btree for each group of values of a key.


Solution

  • If you really insist on doing it this way you can store the values as an array, and the intarray module provides operators to manipulate those. That is:

    create table data(key text primary key, values int[] not null);
    insert into data
      values('Blue', '{10,20,23,47}'),('Green','{5,12,40}'),('Red', '{5,10,28}');
    

    with this you can write:

    select unnest(values) from data where key = 'Blue'
      intersect
      select unnest(values) from data where key = 'Red';
    

    Ideally you need an aggregate function to convert an int[] to a set and calculate intersections etc., but they don't seem to be provided.

    Really, this is just a slightly more compact storage of the more typical structure:

    select key, unnest(values) as value from data;
      key  | value
    -------+-------
     Blue  |    10
     Blue  |    20
     Blue  |    23
    [...]
    

    In fact, you can simply define a view to be the above query.

    A more normalised approach would be to have two tables: one to describe keys, one to associate them with values:

    create table key_dimension(key_id serial primary key, key text not null unique);
    insert into key_dimension(key) values('Blue'),('Green'),('Red');
    create table key_value(key_id int not null references key_dimension(key_id), value int not null, primary key(key_id, value));
    insert into key_value(key_id, value)
      select key_id, unnest(values) from key_dimension join data using (key);
    

    and now:

    select value from key_value
      where key_id = (select key_id from key_dimension where key = 'Red')
    intersect
    select value from key_value
      where key_id = (select key_id from key_dimension where key = 'Blue')
    

    So any queries to select key values need run only against the set of keys (key_dimension), and then a minimal synthetic key (key_id) is used to convert these to actual sets of data values (from key_value).