Search code examples
pythonmemcachedmultiprocessingmod-wsgishelve

Persistent multiprocess shared cache in Python with stdlib or minimal dependencies


I just tried a Python shelve module as the persistent cache for data fetched from the external service. The complete example is here.

I was wondering what would the best approach if I want to make this multiprocess safe? I am aware of redis, memcached and such "real solutions", but I'd like to use only the parts of Python standard library or very minimal dependencies to keep my code compact and not introduce unnecessary complexity when running the code in single process - single thread model.

It's easy to come up with a single-process solution, but this does not work well current Python web run-times. Specifically, the problem would be that in Apache + mod_wsgi enviroment

  • Only one process is updating the cached data once (file locks, somehow?)

  • Other processes use the cached data while the update is under way

  • If the process fails to update the cached data there is penalty of N minutes before another process can try again (to prevent thundering herd and such) - how to signal this between mod_wsgi processes

  • You do not utilize any "heavy tools" for this, only Python standard libraries and UNIX

Also if some PyPi package does this without external dependencies let me know of it please. Alternative approaches and recommendations, like "just use sqlite" are welcome.

Example:

import datetime
import os
import shelve
import logging


logger = logging.getLogger(__name__)


class Converter:

    def __init__(self, fpath):
        self.last_updated = None
        self.data = None

        self.data = shelve.open(fpath)

        if os.path.exists(fpath):
            self.last_updated = datetime.datetime.fromtimestamp(os.path.getmtime(fpath))

    def convert(self, source, target, amount, update=True, determiner="24h_avg"):
        # Do something with cached data
        pass

    def is_up_to_date(self):
        if not self.last_updated:
            return False

        return datetime.datetime.now() < self.last_updated + self.refresh_delay

    def update(self):
        try:
            # Update data from the external server
            self.last_updated = datetime.datetime.now()
            self.data.sync()
        except Exception as e:
            logger.error("Could not refresh market data: %s %s", self.api_url, e)
            logger.exception(e)

Solution

  • I'd say you'd want to use some existing caching library, dogpile.cache comes to mind, it has many features already, and you can easily plug in the backends you might need.

    dogpile.cache documentation tells the following:

    This “get-or-create” pattern is the entire key to the “Dogpile” system, which coordinates a single value creation operation among many concurrent get operations for a particular key, eliminating the issue of an expired value being redundantly re-generated by many workers simultaneously.