Effectively I have multiple Queries within loops that I am just not happy with. I am seeking some expertise with prefetch_related
and other Django Query construction optimisations.
I start with:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
Then, I start my loop over all days starting from a certain date "start_date
":
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
I then within this loop over a filtered subset of the above users
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Firstly, is there any way to optimise this?
What may make some of the hardcore Django-ers lose their tether, I do all of the above inside another loop!
for survey in Survey.objects.all().iterator():
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Inside the last loop, I perform one final Query filter:
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
I do this because I feel I need to have the user, survey and date variables ready to filter...
I have started thinking about prefetch_related and the Prefetch object. I've consulted the documentation but I can't seem to apply this to my situation.
Effectively, the query is taking far too long. For an average of 1000 users, 4 surveys and approximately 30 days, this query is taking 1 minute to complete.
Ideally, I would like to shave off 50% of this. Any better, and I will be extremely happy. I'd also like the load on the DB server to be reduced as this query could be running multiple times across different organisations.
I'd also like to improve how to organise such horrific queries within loops within loops!
Full "condensed" minimum viable snippet:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
datasets = []
for survey in Survey.objects.all():
data = []
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
total_score = 0
participants = 0
for user in users.filter(created_date__lte=date):
participants += 1
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
total_score += survey_result.score
# An average is calculated from the total_score and participants and append to the data array.:
# Divide catches divide by zero errors.
# Round will round to two decimal places for front end readability.
data.append(
round(
divide(total_score, participants), 2
)
)
datasets.append(data)
So, further to @dirkgroten's answer I am currently running with:
for survey in Survey.objects.all():
results = SurveyResult.objects.filter(
user__in=users, survey=survey, created_date__range=date_range
).values(
'survey',
'created_date',
).annotate(
total_score=Sum('normalized_score'),
participants=Count('user'),
average_score=Avg('normalized_score'),
).order_by(
'created_date'
)
for result in results:
print(result)
As I think I need a breakdown by survey for each QuerySet.
Are there any other optimisations available to me?
You can actually combine queries and perform the calculations directly inside your query:
from django.db.models import Sum, Count, Avg
from django.utils import timezone
users = User.objects.filter(organisation=organisation).filter(is_active=True)
date_range = [start_date, timezone.now().date] # or adapt end time to different time zone
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)\
.values('survey', 'created_date')\
.annotate(total_score=Sum('score'), participants=Count('pk'))
.order_by('survey', 'created_date')
This will group the results by survey
and created_date
and add the total_score
and participants
to each result, something like:
[{'survey': 1, 'created_date': '2019-08-05', 'total_score': 54, 'participants': 20},
{'survey': 1, ... } ... ]
I'm assuming there's only one SurveyResult
per user so the number of SurveyResult
in each group is the number of participants.
Note that Avg
also gives you the average score at once, that is assuming only one possible score per user:
.annotate(average_score=Avg('score')) # instead of total and participants
This should shave off 99.9% of your query time :-)
If you want the dataset as a list of lists, you just do something like this:
dataset = []
data = []
current_survey = None
current_date = start_date
for result in results
if not result['survey'] == current_survey:
# results ordered by survey, so if it changes, reset data
if data: dataset.append(data)
data = []
current_survey = result['survey']
if not result['created_date'] == current_date:
# results ordered by date so missing date won't be there later
# assume a daterange function to create a list of dates
for date in daterange(current_date, result['created_date']):
data.append(0) # padding data
current_date = result['created_date']
data.append(result['average_score'])
The result will be a list of lists:
dataset = [[0, 0, 10.4, 3.9, 0], [20.2, 3.5, ...], ...]
Not super efficient python, but with a few 1000 results, this will be super fast anyway, way faster than performing more db queries.
EDIT: Since created_date is DateTimeField
, you first need to get the corresponding date:
from django.db.models.functions import TruncDate
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)
.annotate(date=TruncDate('created_date'))
.values('survey', 'date')
.annotate(average_score=Avg('score'))
.order_by('survey', 'date')