Search code examples
pythondjangodjango-querysetdatabase-performance

Django Prefetch Related with Filter on Max Value


We have a pair of models that look (roughly) like this:

class Machine(models.Model):
    machine_id = models.CharField(max_length=10)
    # Other irrelevant fields

    @property
    def latest_update(self):
        if self.machineupdate_set.count() == 0:
            return None
        return self.machineupdate_set.order_by('-update_time')[:1].get()

class MachineUpdate(models.Model):
    machine = models.ForeignKey(Machine)
    update_time = models.DateTimeField(auto_now_add=True)
    # Other irrelevant fields

Whenever we load Machines from the database, we always end up using the latest_update for that machine. When we first implemented this, we had a lot of machines and a fairly small number of updates per machine, so to improve performance (by reducing query count) we added a simple default prefetch to the model manager for Machine:

class MachineManager(models.Manager):

    def get_queryset(self):
        return super(MachineManager, self).get_queryset().prefetch_related('machineupdate_set')

However, things have changed and now we have huge numbers of updates related to each machine, and the prefetch query is starting to become a problem (both in terms of long query execution time and memory consumption).

We are looking for a smarter way of pre-fetching the required data, as all we really need to pre-fetch is the latest update per machine, not all of them. Having looked at the Django prefetch_related docs it looked like we could change get_queryset in our MachineManager to something like this:

def get_queryset(self):
    latest_update_query = MachineUpdate.objects.order_by('-update_time')[:1]
    latest_update_prefetch = models.Prefetch('machineupdate_set', queryset=latest_update_query, to_attr='_latest_update')
    return super(MachineManager, self).get_queryset().prefetch_related(latest_update_prefetch)

and then modify latest_update to use the new attribute populated by the pre-fetch. However, this does not work because whenever we filter a Machine query using this we get an error: AssertionError: Cannot filter a query once a slice has been taken.

Can anyone suggest a solution to this problem, so that we can efficiently load the latest_update for each machine? We are unsure how to fix the issue we have with the above attempt at pre-fetching the latest updates.

(FYI - we have considered adding a is_latest_update boolean field to MachineUpdate which we can filter on, or alternatively a latest_update foreign key reference on Machine, however we want to avoid having to maintain this redundant information).


Solution

  • I see that MachineUpdate.update_time has auto_now_add=True. So we can use Max(MachineUpdate.id) per Machine group to get the last MachineUpdate. Right? If that's True checkout the following code:

    class MachineManager(models.Manager):
        pass
    
    class MachineQueryset(models.QuerySet):
        def with_last_machineupdate(self):
            return self.prefetch_related(models.Prefetch('machineupdate_set',
                queryset=MachineUpdate.objects.filter(
                    id__in=Machine.objects \
                        .annotate(last_machineupdate_id=models.Max('machineupdate__id')) \
                        .values_list('last_machineupdate_id', flat=True) \
                ),
                #notice the list word
                to_attr='last_machineupdate_list'
            ))
    
    
    class Machine(models.Model):
        machine_id = models.CharField(max_length=10)
        objects = MachineManager.from_queryset(MachineQueryset)()
    
        @property
        def latest_update(self):
            if hasattr(self, 'last_machineupdate_list') and len(self.last_machineupdate_list) > 0:
                return self.last_machineupdate_list[0]
            return None
    
    class MachineUpdate(models.Model):
        machine = models.ForeignKey(Machine)
        update_time = models.DateTimeField(auto_now_add=True)
    
        def __unicode__(self):
            return str(self.update_time)
    

    Usage:

    machines = Machine.objects.filter(...).with_last_machineupdate()
    

    If that's not the case, e.g. we can't use Max('machineupdate__id') and we need to stick with update_time field. Then a slightly more optimized solution (but still getting all MachineUpdates per Machine) looks like this:

    class MachineManager(models.Manager):
        def get_queryset(self):
            return super(MachineManager, self).get_queryset() \
                .prefetch_related(models.Prefetch('machineupdate_set',
                    queryset=MachineUpdate.objects.order_by('-update_time')
                ))
    
    class Machine(models.Model):
        machine_id = models.CharField(max_length=10)
        objects = MachineManager()
    
        @property
        def latest_update(self):
            #this will not make queries
            machine_updates = self.machineupdate_set.all()
            if len(machine_updates) > 0:
                return machine_updates[0]
            return None