Search code examples
pythondjangodjango-1.11aggregatesdjango-mysql

Django - Time aggregates of DatetimeField across queryset


(using django 1.11.2, python 2.7.10, mysql 5.7.18)

If we imagine a simple model:

class Event(models.Model):
    happened_datetime = DateTimeField()
    value = IntegerField()

What would be the most elegant (and quickest) way to run something similar to:

res = Event.objects.all().aggregate(
    Avg('happened_datetime')
)

But that would be able to extract the average time of day for all members of the queryset. Something like:

res = Event.objects.all().aggregate(
    AvgTimeOfDay('happened_datetime')
)

Would it be possible to do this on the db directly?, i.e., without running a long loop client-side for each queryset member?

EDIT:

There may be a solution, along those lines, using raw SQL:

select sec_to_time(avg(time_to_sec(extract(HOUR_SECOND from happened_datetime)))) from event_event;

Performance-wise, this runs in 0.015 second for ~23k rows on a laptop, not optimised, etc. Assuming that could yield accurate/correct results and since time is only a secondary factor, could I be using that?


Solution

  • So after a little search and tries.. the below seems to work. Any comments on how to improve (or hinting as to why it is completely wrong), are welcome! :-)

    res = Event.objects.raw('''
    SELECT id, sec_to_time(avg(time_to_sec(extract(HOUR_SECOND from happened_datetime)))) AS average_time_of_day
    FROM event_event
    WHERE happened_datetime BETWEEN %s AND %s;''', [start_datetime, end_datetime])
    
    print res[0].__dict__
    # {'average_time_of_day': datetime.time(18, 48, 10, 247700), '_state': <django.db.models.base.ModelState object at 0x0445B370>, 'id': 9397L}
    

    Now the ID returned is that of the last object falling in the datetime range for the WHERE clause. I believe Django just inserts that because of "InvalidQuery: Raw query must include the primary key".

    Quick explanation of the SQL series of function calls:

    1. Extract HH:MM:SS from all datetime fields
    2. Convert the time values to seconds via time_to_sec.
    3. average all seconds values
    4. convert averaged seconds value back into time format (HH:MM:SS)

    Don't know why Django insists on returning microseconds but that is not really relevant. (maybe the local ms at which the time object was instantiated?)

    Performance note: this seems to be extremely fast but then again I haven't tested that bit. Any insight would be kindly appreciated :)