django annotations django-queryset window-functions

Django - Annotating each Event with Sum of participants that overlap with the start DateTime of that Event

I have an Event model with a start DateTime and end DateTime, as well as the number of participants.

For each Event object, I want to get an annotated Sum of ALL the participants in any event that overlaps with the start DateTime. This is so I can make sure there aren't too many participants at any given time.

class Event(models.Model):
    start = models.DateTime()
    end = models.DateTime()
    participants = models.IntegerField()

I've been reading about Window functions, and maybe that would work here, but I can't get it right.

I tried this, but it doesn't work because it's looking to group together events with the SAME start DateTime, not overlapping periods of start and end DateTimes against the original event start DateTime.

starts = Event.objects.annotate(
    participants_sum=Window(
    expression=Sum('participants'),
    partition_by=[F('start'),],
    order_by=ExtractDay('start').asc(),
    ),
).values('participants', 'participants_sum', 'start')

Any recommendations would be appreciated!

Thanks so much to @endre-both with his/her help I was able to solve the bigger problem.

The end result I want the values of every start and end transition in my Events table so that I can determine the periods of time with too many participants. But I was worried that explaining that would be too complex.

Here is what I ended up with

from django.contrib.gis.db import models
from django.db.models import F, Window, Sum
from django.utils import timezone

overlap_filter_start = Q(start__lte=OuterRef('start'), end__gte=OuterRef('start'))
overlap_filter_end = Q(start__lte=OuterRef('end'), end__gte=OuterRef('end'))

subquery_start = Subquery(Event.objects
    .filter(overlap_filter_start)
    .annotate(sum_participants=Window(expression=Sum('participants'),))
    .values('sum_participants')[:1],
    output_field=models.IntegerField()
)

subquery_end = Subquery(Event.objects
    .filter(overlap_filter_end)
    .annotate(sum_participants=Window(expression=Sum('participants'),))
    .values('sum_participants')[:1],
    output_field=models.IntegerField()
)

# Will eventually filter the dates I'm checking over specific date ranges rather than the entire Event table
# but for simplicity, filtering from yesterday to tomorrow
before = timezone.now().date() - timezone.timedelta(days=1)
after = timezone.now().date() + timezone.timedelta(days=1)

events_start = Event.objects.filter(start__date__lte=after, start__date__gte=before).annotate(simultaneous_participants=subquery_start)
events_end = Event.objects.filter(end__date__lte=after, end__date__gte=before).annotate(simultaneous_participants=subquery_end)

# Here I combine the queries for *start* transition moments and *end* transition moments, and rename the DateTime I'm looking at to *moment*, and make sure to only return distinct moments (since two equal moments will have the same number of participants)

events = events_start.annotate(moment=F('start')).values('moment', 'simultaneous_participants').union(
    events_end.annotate(moment=F('end')).values('moment', 'simultaneous_participants')).order_by('moment').distinct()

for event in events:
    print(event)

print(events.count())

Now I can take the resulting relatively small resulting queryset and process in Python to determine where the number of participants goes too high, and when it drops back down to acceptable levels.

There's probably a more efficient way to approach this, but I'm pretty happy with this. Much better than trying to do all the heavy lifting in Python.

The resulting output is something like this:

{'simultaneous_participants': 45, 'moment': datetime.datetime(2019, 3, 23, 7, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 45, 'moment': datetime.datetime(2019, 3, 23, 11, 30, tzinfo=<UTC>)}
{'simultaneous_participants': 40, 'moment': datetime.datetime(2019, 3, 23, 14, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 40, 'moment': datetime.datetime(2019, 3, 23, 15, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 35, 'moment': datetime.datetime(2019, 3, 23, 16, 30, tzinfo=<UTC>)}
{'simultaneous_participants': 85, 'moment': datetime.datetime(2019, 3, 24, 19, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 125, 'moment': datetime.datetime(2019, 3, 25, 12, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 90, 'moment': datetime.datetime(2019, 3, 25, 12, 30, tzinfo=<UTC>)}
{'simultaneous_participants': 135, 'moment': datetime.datetime(2019, 3, 25, 13, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 110, 'moment': datetime.datetime(2019, 3, 25, 18, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 160, 'moment': datetime.datetime(2019, 3, 25, 19, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 160, 'moment': datetime.datetime(2019, 3, 25, 20, 30, tzinfo=<UTC>)}
{'simultaneous_participants': 115, 'moment': datetime.datetime(2019, 3, 25, 22, 0, tzinfo=<UTC>)}
{'simultaneous_participants': 80, 'moment': datetime.datetime(2019, 3, 25, 23, 30, tzinfo=<UTC>)}
14

Solution

To annotate your Events with an aggregate that is filtered on some criteria based on the individual Event, you need individual subqueries per event.

This filter should help finding all events that overlap with a specific time range:

overlap_filter = Q(start__lte=OuterRef('end'), end__gte=OuterRef('start'))

This gets you all events that start before or at the end time and end at or after the start time. The filter will be used in a subquery, and with OuterRef we refer to the fields in the outer query.

Next, the subquery. It's unexpectedly difficult to get an aggregate from a subquery as aggregates are not lazy (= they are executed immediately) and a Subquery needs to be. One workaround is to use Window:

subquery = Subquery(Event.objects
        .filter(overlap_filter)
        .annotate(sum_participants=Window(Sum('participants'),))
        .values('sum_participants')[:1],
    output_field=IntegerField()
)

Finally, the query with the annotated Events:

events = Event.objects.annotate(simultaneous_participants=subquery)

Note that while the presence of the participants in this count overlaps with the Event we are looking at, they do not necessarily overlap with each other – they are all present at some time during the duration of the Event, but not all of them at the same time – some may leave before others arrive. To calculate actual attendance peaks, you'd need to look at smaller time increments (depending on how the start and end times are staggered).