Search code examples
pythondatabasehashshelve

Cache results of a time-intensive operation


I have a program (PatchDock), which takes its input from a parameters file, and produces an output file. Running this program is time-intensive, and I'd like to cache results of past runs so that I need not run the same parameters twice.

I'm able to parse the input and output files into appropriate data structures. The input file, for example, is parsed into a dictionary-like object. The input keys are all strings, and the values are primitive data types (ints, strings, and floats).

My approach

My first idea was to use the md5 hash of the input file as the keys in a shelve database. However, this fails to capture cached files with the exact same inputs, but some slight differences in the input files (comments, spacing, order of parameters, et cetera).

Hashing the parsed parameters seems like the best approach to me. But the only way I can think of getting a unique hash from a dictionary is to hash a sorted string representation.

Question

Hashing a string representation of a parameters dictionary seems like a roundabout way of achieving my end goal- keying unique input values to a specified output. Is there a more straightforward way to achieve this caching system?

Ideally, I'm looking to achieve this in Python.


Solution

  • Hashing a sorted representation of the parsed input is actually the most straightforward way of doing this, and the one that makes sense. Your instincts were correct.

    Basically, you're normalizing the input (by parsing it and sorting it), and then using that to construct a hash key.