Search code examples
djangodjango-querysetcorrelated-subquery

How to get a Count based on a subquery?


I am suffering to get a query working despite all I have been trying based on my web search, and I think I need some help before becoming crazy.

I have four models:

class Series(models.Model):
    puzzles = models.ManyToManyField(Puzzle, through='SeriesElement', related_name='series')
    ...

class Puzzle(models.Model):
    puzzles = models.ManyToManyField(Puzzle, through='SeriesElement', related_name='series')
    ...

class SeriesElement(models.Model):
    puzzle = models.ForeignKey(Puzzle,on_delete=models.CASCADE,verbose_name='Puzzle',)
    series = models.ForeignKey(Series,on_delete=models.CASCADE,verbose_name='Series',)
    puzzle_index = models.PositiveIntegerField(verbose_name='Order',default=0,editable=True,)

class Play(models.Model):
    puzzle = models.ForeignKey(Puzzle, on_delete=models.CASCADE, related_name='plays')
    user = models.ForeignKey(settings.AUTH_USER_MODEL, blank=True,null=True, on_delete=models.SET_NULL, related_name='plays')
    series = models.ForeignKey(Series, blank=True, null=True, on_delete=models.SET_NULL, related_name='plays')
    puzzle_completed = models.BooleanField(default=None, blank=False, null=False)
    ...

each user can play any puzzle several times, each time creating a Play record. that means that for a given set of (user,series,puzzle) we can have several Play records, some with puzzle_completed = True, some with puzzle_completed = False

What I am trying (unsuccesfully) to achieve, is to calculate for each series, through an annotation, the number of puzzles nb_completed_by_user and nb_not_completed_by_user.

For nb_completed_by_user, I have something which works in almost all cases (I have one glitch in one of my test that I cannot explain so far):

Series.objects.annotate(nb_completed_by_user=Count('puzzles',
filter=Q(puzzles__plays__puzzle_completed=True, 
    puzzles__plays__series_id=F('id'),puzzles__plays__user=user), distinct=True))

For nb_not_completed_by_user, I was able to make a query on Puzzle that gives me the good answer, but I am not able to transform it into a Subquery expression that works without throwing up an error, or to get a Count expression to give me the proper answer.

This one works:

puzzles = Puzzle.objects.filter(~Q(plays__puzzle_completed=True,
 plays__series_id=1, plays__user=user),series=s)

but when trying to move to a subquery, I cannot find the way to use the following expression not to throw the error:ValueError: This queryset contains a reference to an outer query and may only be used in a subquery.

pzl_completed_by_user = Puzzle.objects.filter(plays__series_id=OuterRef('id')).exclude(
    plays__puzzle_completed=True,plays__series_id=OuterRef('id'), plays__user=user)

and the following Count expression doesn't give me the right result:

Series.objects.annotate(nb_not_completed_by_user=Count('puzzles', filter=~Q(
            puzzle__plays__puzzle_completed=True, puzzle__plays__series_id=F('id'), 
            puzzle__plays__user=user))

Could anybody explain me how I could obtain both values ? and eventually to propose me a link which explains clearly how to use subqueries for less-obvious cases than those in the official documentation

Thanks in advance


Edit March 2021: I recently found two posts which guided me through one potential solution to this specific issue: Django Count and Sum annotations interfere with each other and Django 1.11 Annotating a Subquery Aggregate

I implemented the proposed solution from https://stackoverflow.com/users/188/matthew-schinckel and https://stackoverflow.com/users/1164966/benoit-blanchon having help classes: class SubqueryCount(Subquery) and class SubquerySum(Subquery)

class SubqueryCount(Subquery):
    template = "(SELECT count(*) FROM (%(subquery)s) _count)"
    output_field = PositiveIntegerField()


class SubquerySum(Subquery):
    template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'

    def __init__(self, queryset, column, output_field=None, **extra):
        if output_field is None:
            output_field = queryset.model._meta.get_field(column)
        super().__init__(queryset, output_field, column=column, **extra)

It works extremely well ! and is far quicker than the conventional Django Count annotation. ... at least in SQlite, and probably PostgreSQL as stated by others.

But when I tried in a MariaDB environnement ... it crashed ! MariaDB is apparently not able / not willing to handle correlated subqueries as those are considered sub-optimal.

In my case, as I try to get from the database multiple Count/distinct annotations for each record at the same time, I really see a tremendous gain in performance (in SQLite) that I would like to replicate in MariaDB.

Would anyone be able to help me figure out a way to implement those helper functions for MariaDB ?

What should template be in this environnement?

matthew-schinckel ? benoit-blanchon ? rktavi ?


Solution

  • Going a bit deeper and analysis the Django docs a bit more in details, I was finally able to produce a satisfying way to produce a Count or Sum based on subquery.

    For simplifying the process, I defined the following helper functions:

    To generate the subquery:

    def get_subquery(app_label, model_name, reference_to_model_object, filter_parameters={}):
        """
        Return a subquery from a given model (work with both FK & M2M)
        can add extra filter parameters as dictionary:
    
        Use:
            subquery = get_subquery(
                        app_label='puzzles', model_name='Puzzle',
                        reference_to_model_object='puzzle_family__target'
                        )
            or directly:
            qs.annotate(nb_puzzles=subquery_count(get_subquery(
                'puzzles', 'Puzzle','puzzle_family__target')),)
        """
        model = apps.get_model(app_label, model_name)
    
        # we need to declare a local dictionary to prevent the external dictionary to be changed by the update method:
        parameters = {f'{reference_to_model_object}__id': OuterRef('id')}
        parameters.update(filter_parameters)
        # putting '__id' instead of '_id' to work with both FK & M2M
        return model.objects.filter(**parameters).order_by().values(f'{reference_to_model_object}__id')
    

    To count the subquery generated through get_subquery:

    def subquery_count(subquery):
        """  
        Use:
            qs.annotate(nb_puzzles=subquery_count(get_subquery(
                'puzzles', 'Puzzle','puzzle_family__target')),)
        """
        return Coalesce(Subquery(subquery.annotate(count=Count('pk', distinct=True)).order_by().values('count'), output_field=PositiveIntegerField()), 0)
    

    To sum the subquery generated through get_subquery on the field field_to_sum:

    def subquery_sum(subquery, field_to_sum, output_field=None):
        """  
        Use:
            qs.annotate(total_points=subquery_sum(get_subquery(
                'puzzles', 'Puzzle','puzzle_family__target'),'points'),)
        """
        if output_field is None:
            output_field = queryset.model._meta.get_field(column)
    
        return Coalesce(Subquery(subquery.annotate(result=Sum(field_to_sum, output_field=output_field)).order_by().values('result'), output_field=output_field), 0)
    

    The required imports:

    from django.db.models import Count, Subquery, PositiveIntegerField, DecimalField, Sum
    from django.db.models.functions import Coalesce
    

    I spent so many hours on solving this ... I hope that this will save many of you all the frustration I went through figuring out the right way to proceed.