Search code examples
pythonsqlitedictionaryshelve

Proper choice for a database consisting of a dictionary


I have a large dictionary that has the following format:

dict["randomKey"]=[dict1,dict2,int,string]

There will probably be some tens of thousands of keys. dict1 itself has ~ 100 keys.

Question is: I need this dictionary to be stored on a server and read by multiple machines. What is the best format for this?

I am now using shelve which is very easy to use. However, I will be needing to get all the keys from the main dictionary(dict) for a certain value of a key from dict1 or dict2, which is taking some time and I'm afraid that when the dictionary will be much larger, as in 50k keys, it's going to take forever. I've read about sqlite3 and it seems a good alternative but I don't know if it's a good solution for my needs.

I don't really need the database to be accessible to other than Python programs (it would be nice though) but I need it to be fast,stable and to be able to have many computers read from it in the same time. Thanks!


Solution

  • I'd choose a database with native json support, which can search in json dictionaries efficiently. I like PostgreSQL:

    A table for your data:

    create table dict (
      key text primary key,
      dict1 jsonb not null default '{}',
      dict2 jsonb not null default '{}',
      intval integer not null,
      strval text not null
    );
    

    Fill it up with some sample values:

    insert into dict
    select
      i::text,
      (select
        jsonb_object(
          array_agg('k'||v::text),
          array_agg('v'||(v+i)::text)
        ) from generate_series(1,1000) as v
      ),
      (select
        jsonb_object(
          array_agg('k'||v::text),
          array_agg('v'||(v+i)::text)
        ) from generate_series(1,1000) as v
      ),
      i,
      i::text
    from generate_series(1,10000) as i;
    

    Get the key for which there's value v134 for key k6 in dict1:

    select key from dict where dict1 @> '{"k6":"v134"}';
     key 
    -----
     128
    (1 row)
    
    Time: 232.843 ms
    

    You can even index the dictionary columns for even faster search if your table is really large. But these indexes would be larger than a table itself and the database can decide that it's safer to not use them anyway:

    create index dict_dict1_idx on dict using gin(dict1);
    create index dict_dict2_idx on dict using gin(dict2);
    

    You can force the database to use the indexes, if you know that it's beneficial:

    set enable_seqscan=off;
    select key from dict where dict1 @> '{"k6":"v134"}';
     key 
    -----
     128
    (1 row)
    
    Time: 8.955 ms