Search code examples
pythondjangodjango-ormdjango-2.2

Django prefetch_related and performance optimisation with multiple QuerySets within loops


Effectively I have multiple Queries within loops that I am just not happy with. I am seeking some expertise with prefetch_related and other Django Query construction optimisations.

I start with:

users = User.objects.filter(organisation=organisation).filter(is_active=True)

Then, I start my loop over all days starting from a certain date "start_date":

for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):

I then within this loop over a filtered subset of the above users

for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
  for user in users.filter(created_date__lte=date).iterator():

Firstly, is there any way to optimise this?

What may make some of the hardcore Django-ers lose their tether, I do all of the above inside another loop!

for survey in Survey.objects.all().iterator():
   for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
      for user in users.filter(created_date__lte=date).iterator():

Inside the last loop, I perform one final Query filter:

survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]

I do this because I feel I need to have the user, survey and date variables ready to filter...

I have started thinking about prefetch_related and the Prefetch object. I've consulted the documentation but I can't seem to apply this to my situation.

Effectively, the query is taking far too long. For an average of 1000 users, 4 surveys and approximately 30 days, this query is taking 1 minute to complete.

Ideally, I would like to shave off 50% of this. Any better, and I will be extremely happy. I'd also like the load on the DB server to be reduced as this query could be running multiple times across different organisations.

I'd also like to improve how to organise such horrific queries within loops within loops!

Full "condensed" minimum viable snippet:


users = User.objects.filter(organisation=organisation).filter(is_active=True)

datasets = []

for survey in Survey.objects.all():
    data = []
    for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
        total_score = 0
        participants = 0

        for user in users.filter(created_date__lte=date):
             participants += 1
             survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
             total_score += survey_result.score
      
        # An average is calculated from the total_score and participants and append to the data array.:
        # Divide catches divide by zero errors.
        # Round will round to two decimal places for front end readability.
        data.append(
            round(
                divide(total_score, participants), 2
            )
        )

    datasets.append(data)
    

Addendum

So, further to @dirkgroten's answer I am currently running with:

for survey in Survey.objects.all():

                results = SurveyResult.objects.filter(
                    user__in=users, survey=survey, created_date__range=date_range
                ).values(
                    'survey',
                    'created_date',
                ).annotate(
                    total_score=Sum('normalized_score'),
                    participants=Count('user'),
                    average_score=Avg('normalized_score'),
                ).order_by(
                    'created_date'
                )

                for result in results:
                    print(result)

As I think I need a breakdown by survey for each QuerySet.

Are there any other optimisations available to me?


Solution

  • You can actually combine queries and perform the calculations directly inside your query:

    from django.db.models import Sum, Count, Avg
    from django.utils import timezone
    users = User.objects.filter(organisation=organisation).filter(is_active=True)
    date_range = [start_date, timezone.now().date]  # or adapt end time to different time zone
    results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)\
                  .values('survey', 'created_date')\
                  .annotate(total_score=Sum('score'), participants=Count('pk'))
                  .order_by('survey', 'created_date')
    

    This will group the results by survey and created_date and add the total_score and participants to each result, something like:

    [{'survey': 1, 'created_date': '2019-08-05', 'total_score': 54, 'participants': 20}, 
     {'survey': 1, ... } ... ]
    

    I'm assuming there's only one SurveyResult per user so the number of SurveyResult in each group is the number of participants.

    Note that Avg also gives you the average score at once, that is assuming only one possible score per user:

    .annotate(average_score=Avg('score'))  # instead of total and participants
    

    This should shave off 99.9% of your query time :-)

    If you want the dataset as a list of lists, you just do something like this:

    dataset = []
    data = []
    current_survey = None
    current_date = start_date
    for result in results
        if not result['survey'] == current_survey:
            # results ordered by survey, so if it changes, reset data
            if data: dataset.append(data)            
            data = []
            current_survey = result['survey']
        if not result['created_date'] == current_date:
            # results ordered by date so missing date won't be there later 
            # assume a daterange function to create a list of dates
            for date in daterange(current_date, result['created_date']):
                data.append(0)  # padding data
         current_date = result['created_date']
         data.append(result['average_score'])
    

    The result will be a list of lists:

    dataset = [[0, 0, 10.4, 3.9, 0], [20.2, 3.5, ...], ...]
    

    Not super efficient python, but with a few 1000 results, this will be super fast anyway, way faster than performing more db queries.

    EDIT: Since created_date is DateTimeField, you first need to get the corresponding date:

    from django.db.models.functions import TruncDate
    results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)
                   .annotate(date=TruncDate('created_date'))
                   .values('survey', 'date')
                   .annotate(average_score=Avg('score'))
                   .order_by('survey', 'date')