Search code examples
pythonpandasjoblib

Selective Re-Memoization of DataFrames


Say I setup memoization with Joblib as follows (using the solution provided here):

from tempfile import mkdtemp
cachedir = mkdtemp()

from joblib import Memory
memory = Memory(cachedir=cachedir, verbose=0)

@memory.cache
def run_my_query(my_query)
    ...
    return df

And say I define a couple of queries, query_1 and query_2, both of them take a long time to run.

I understand that, with the code as it is:

  • The second call with either query, would use the memoized output, i.e:

    run_my_query(query_1)
    run_my_query(query_1) # <- Uses cached output
    
    run_my_query(query_2)
    run_my_query(query_2) # <- Uses cached output   
    
  • I could use memory.clear() to delete the entire cache directory

But what if I want to re-do the memoization for only one of the queries (e.g. query_2) without forcing a delete on the other query?


Solution

  • It seems like the library does not support partial erase of the cache.

    You can separate the cache, functino into two pairs:

    from tempfile import mkdtemp
    from joblib import Memory
    
    memory1 = Memory(cachedir=mkdtemp(), verbose=0)
    memory2 = Memory(cachedir=mkdtemp(), verbose=0)
    
    @memory1.cache
    def run_my_query1()
        # run query_1
        return df
    
    @memory2.cache
    def run_my_query2()
        # run query_2
        return df
    

    Now, you can selectively clear the cache:

    memory2.clear()
    

    UPDATE after seeing behzad.nouri's comment:

    You can use call method of decorated function. But as you can see in the following example, the return value is different from the normal call. You should take care of it.

    >>> import tempfile
    >>> import joblib
    >>> memory = joblib.Memory(cachedir=tempfile.mkdtemp(), verbose=0)
    >>> @memory.cache
    ... def run(x):
    ...     print('called with {}'.format(x))  # for debug
    ...     return x
    ...
    >>> run(1)
    called with 1
    1
    >>> run(2)
    called with 2
    2
    >>> run(3)
    called with 3
    3
    >>> run(2)  # Cached
    2
    >>> run.call(2)  # Force call of the original function
    called with 2
    (2, {'duration': 0.0011069774627685547, 'input_args': {'x': '2'}})