Suppose we have a set of questions, and a set of students that answered these questions. The answers have been reviewed, and scores have been assigned, on some unknown range.
Now, we need to normalize the scores with respect to the extreme values within each question. For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).
Then, for each student, we would like to know the mean of the normalized scores for all questions combined.
Here's a very naive minimal implementation, just to illustrate what we would like to achieve:
class Question(models.Model):
pass
class Student(models.Model):
def mean_normalized_score(self):
normalized_scores = []
for score in self.score_set.all():
normalized_scores.append(score.normalized_value())
return mean(normalized_scores) if normalized_scores else None
class Score(models.Model):
student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
value = models.FloatField()
def normalized_value(self):
limits = Score.objects.filter(question=self.question).aggregate(
min=models.Min('value'), max=models.Max('value'))
return (self.value - limits['min']) / (limits['max'] - limits['min'])
This works well, but it is quite inefficient in terms of database queries, etc.
Instead of the implementation above, I would prefer to offload the number-crunching on to the database.
Consider, for example, these two use cases:
normalized_value
for all Score
objectsmean_normalized_score
for all Student
objectsThe first use case can be covered using window functions in a query, something like this:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
normalized_value=(F('value') - w_min) / (w_max - w_min))
This works nicely, so the Score.normalized_value()
method from the example is no longer needed.
Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score()
method by a single database query.
The raw SQL could look something like this (for sqlite):
SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
SELECT
myapp_score.*,
((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
AS normalized_value
FROM myapp_score
)
GROUP BY student_id
I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django's ORM.
I've tried building on the annotated_scores
queryset described above, using Django's Subquery, annotate()
, aggregate()
, Prefetch
, and combinations of those, but I must be making a mistake somewhere.
Probably the closest I've gotten is this:
subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))
But this is incorrect.
Could someone point me in the right direction, without resorting to raw queries?
I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that's what is blocking the calculation of the mean of the normalized values. I've added comments on the lines to explain what I'm trying to do:
# Get the minimum score per question
min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))
# Get the maximum score per question
max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))
# Calculate the normalized value per score, then get the average by grouping by students
mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
min=Subquery(min_subquery.values('min')[:1]),
max=Subquery(max_subquery.values('max')[:1]),
normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
).values('student').annotate(mean=Avg('normalized'))
# Get the calculated mean per student
Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))
The resulting SQL is:
SELECT
"student"."id",
"student"."name",
(
SELECT
AVG(
(
(
V0."value" - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
) / (
(
SELECT
MAX(U0."value") AS "max"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
) - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
)
)
) AS "mean"
FROM
"score" V0
WHERE
V0."student_id" = ("student"."id")
GROUP BY
V0."student_id"
LIMIT
1
) AS "mean"
FROM
"student"