Search code examples
pythonpersistencelookuplookup-tablesobject-persistence

Large list of dictionaries as a lookup table on disk


I have a very large list of dictionaries (GBs in size) obtained from an API. I would like to use it as a lookup table for other functions. There are several object persistence methods in Python, but what would you recommend for storing lists of dictionaries on disk for easy referencing and lookup?

{
    "library_id": "7", 
    "set_id": "80344779", 
    "description": "Very long description 1 ...", 
    "value": "1"
  }, 
  {
    "library_id": "22", 
    "set_id": "80344779", 
    "description": "Very long description 2 ...", 
    "value": "1"
  }, 
  {
    "library_id": "24", 
    "set_id": "80344779", 
    "description": "Very long description 3 ...", 
    "value": "8"
  }, 

Solution

  • Your data seems to be regular, i.e. there is no variation of the dict's keys, right? One might simply use document-based solutions like MongoDB, but I think a simple SQL-based database might be more efficient and is easy to implement.

    Alternatives would be the pickle module (not recommended for really large objects, as they are loaded into the memory) or shelve which builds on top of pickle, but is more efficient with large files, afaik (they aren't loaded into your memory at once). The benefit of shelve is it's syntax, which mimics pythons dict-syntax and should be easy to use (see the link). And there is no need to set up a MongoDB or MySQL database (which might get complicated, at least on Windows). Both pickle and shelve are part of the standard-lib.

    You also might check datasets and it's easy-to-use interface. It uses a sqlite-db under the hood.

    If you're dealing with huge files (let's say > 2 GB), I'd not stick to datasets or shelve, but use more mature soultions like sqlalchemy (+ MySQL-DB) or MongoDB and it's Python interface (PyMongo)