Search code examples
pythoncachingpyqt5background-processprefetch

Prefetch items and return requested item immediately


I need to load a lot of large image data from a network-share for processing (which is not very fast). The images are named following a sequence (e.g. 1.png, 2.png, 3.png, etc.).

In most cases, loading will happen in this sequence (loading n+1.png after n.png). I would like to have n+1.png in memory before the actual request.

I would like to keep a cache (as well), such that going 1 image back does not require disk access.

I envision something like this:

  1. Request image with index n
  2. Check if n.png is in cache, if the image is not in cache: a. load the image from disk b. put the image in cache
  3. Perform steps 1&2 for image with index n+1
  4. Do not wait for step 3 to finish, but take the image from cache and return that image

Nice to have feature: clean the cache in the backgound such that it only contains the last requested 10 items, or that it removes the first requested items until it contains a max. of 10 items (I can imagine the latter option is easier to implement while being good enough for my case).

I am using Python 3.5. I am using PyQt5, but I prefer the function to not rely on PyQt5 functionality (but if this makes the implementation much more clean/easy/readable I will use it).


Solution

  • The simple answer (assuming you're not using coroutines or the like, which you probably aren't given that you're using PyQt5) is to spawn a daemon background thread to load image n+1 into the cache. Like this:

    def load(self, n):
        with self._cache_lock:
            try:
                return self._cache[n]
            except KeyError:
                self._cache[n] = self._load_image(n)
                return self._cache[n]
    def process_image(self, n):
        img = self.load(n)
        threading.Thread(target=partial(self.load, n+1), daemon=True).start()
        self.step1(img)
        self.step2(img)
    

    The problem with this design is that you're holding a lock around the entire _load operation. If step1 and step2 take significantly longer than _load_image, it may be cheaper to avoid that lock by allowing rare duplicate work:

    def cacheget(self, n):
        with self._cache_lock:
            return self._cache.get(n)
    def preload(self, n):
        img = self._load_image(n)
        with self._cache_lock:
            self._cache[n] = img
        return img
    def process_image(self, n):
        img = self.cacheget(n)
        if img is None:
            img = self.preload(n)
        threading.Thread(target=partial(self.preload, n+1), daemon=True).start()
        self.step1(img)
        self.step2(img)
    

    If you're expecting to do lots of processing in parallel, you may want to use a ThreadPoolExecutor to queue up all of your preloads, instead of a daemon thread for each one.

    If you want to clean old cache values, see lru_cache and its implementation. There are a lot of tuning decisions to make (like: do you actually want background cache garbage collection, or can you just push the oldest item out whenever you add a 10th item the way lru_cache does?), but none of the options are particularly hard to build once you decide what you want.