Search code examples
pythondjangodjango-queryset

Aggregating a windowed queryset in Django


Background

Suppose we have a set of questions, and a set of students that answered these questions. The answers have been reviewed, and scores have been assigned, on some unknown range.

Now, we need to normalize the scores with respect to the extreme values within each question. For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).

Then, for each student, we would like to know the mean of the normalized scores for all questions combined.

Minimal example

Here's a very naive minimal implementation, just to illustrate what we would like to achieve:

class Question(models.Model):
    pass


class Student(models.Model):
    def mean_normalized_score(self):
        normalized_scores = []
        for score in self.score_set.all():
            normalized_scores.append(score.normalized_value())
        return mean(normalized_scores) if normalized_scores else None


class Score(models.Model):
    student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
    question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
    value = models.FloatField()

    def normalized_value(self):
        limits = Score.objects.filter(question=self.question).aggregate(
            min=models.Min('value'), max=models.Max('value'))
        return (self.value - limits['min']) / (limits['max'] - limits['min'])

This works well, but it is quite inefficient in terms of database queries, etc.

Goal

Instead of the implementation above, I would prefer to offload the number-crunching on to the database.

What I've tried

Consider, for example, these two use cases:

  1. list the normalized_value for all Score objects
  2. list the mean_normalized_score for all Student objects

The first use case can be covered using window functions in a query, something like this:

w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
    normalized_value=(F('value') - w_min) / (w_max - w_min))

This works nicely, so the Score.normalized_value() method from the example is no longer needed.

Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score() method by a single database query.

The raw SQL could look something like this (for sqlite):

SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
    SELECT
        myapp_score.*,
        ((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
        AS normalized_value
    FROM myapp_score
    ) 
GROUP BY student_id

I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django's ORM.

I've tried building on the annotated_scores queryset described above, using Django's Subquery, annotate(), aggregate(), Prefetch, and combinations of those, but I must be making a mistake somewhere.

Probably the closest I've gotten is this:

subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))

But this is incorrect.

Could someone point me in the right direction, without resorting to raw queries?


Solution

  • I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that's what is blocking the calculation of the mean of the normalized values. I've added comments on the lines to explain what I'm trying to do:

    # Get the minimum score per question
    min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))
    
    # Get the maximum score per question
    max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))
    
    # Calculate the normalized value per score, then get the average by grouping by students
    mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
        min=Subquery(min_subquery.values('min')[:1]), 
        max=Subquery(max_subquery.values('max')[:1]), 
        normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
    ).values('student').annotate(mean=Avg('normalized'))
    
    # Get the calculated mean per student
    Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))
    

    The resulting SQL is:

    SELECT 
      "student"."id", 
      "student"."name", 
      (
        SELECT 
          AVG(
            (
              (
                V0."value" - (
                  SELECT 
                    MIN(U0."value") AS "min" 
                  FROM 
                    "score" U0 
                  WHERE 
                    U0."question_id" = (V0."question_id") 
                  GROUP BY 
                    U0."question_id" 
                  LIMIT 
                    1
                )
              ) / (
                (
                  SELECT 
                    MAX(U0."value") AS "max" 
                  FROM 
                    "score" U0 
                  WHERE 
                    U0."question_id" = (V0."question_id") 
                  GROUP BY 
                    U0."question_id" 
                  LIMIT 
                    1
                ) - (
                  SELECT 
                    MIN(U0."value") AS "min" 
                  FROM 
                    "score" U0 
                  WHERE 
                    U0."question_id" = (V0."question_id") 
                  GROUP BY 
                    U0."question_id" 
                  LIMIT 
                    1
                )
              )
            )
          ) AS "mean" 
        FROM 
          "score" V0 
        WHERE 
          V0."student_id" = ("student"."id") 
        GROUP BY 
          V0."student_id" 
        LIMIT 
          1
      ) AS "mean" 
    FROM 
      "student"