Search code examples
djangogroup-byfilteringdjango-querysetdjango-annotate

How to delete everything but the latest object in each group in Django


I want to group my model objects by three fields, and delete all objects but the youngest for each group.

My model:

class DataFile(models.Model):
    filename = models.CharField(unique=True, max_length=256)
    timestamp = models.DateTimeField()
    tile_label = models.CharField(max_length=256, blank=True)
    region = models.CharField(max_length=256, blank=True)
    resolution = models.CharField(max_length=256, blank=True)

This query gives me the list of the values I want to keep:

DataFile.objects.values('resolution', 'region', 'tile_label').annotate(maxfield=Max('timestamp'))

The result of this is

{'resolution': '30', 'region': 'SRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 9, 11, 7, 13, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'NRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 8, 16, 2, 8, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'NRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 15, 23, 5, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'SRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 11, 1, 13, 46, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'NRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 11, 5, 12, 20, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'NRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 18, 5, 54, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'SRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 5, 21, 8, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'SRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 18, 8, 29, tzinfo=<UTC>)}

I now need to filter all DataFile objects except the one with the highest date in each group of resolution, region, tile_label, and then delete them (i.e. keep the one with the highest date).

How do I do this? I need to do a GROUP BY with values() in order to get the maximum of each group, but this means that I'm not operating on the whole queryset any more.
I think there's a solution involving subqueries, but I never really got my head around how they work.

I need a database solution, so looping is not an option. I'd also prefer to avoid __in queries for performance reasons (but would ultimately do it if there's no other solution).


Solution

  • This query gives me the list of the values I want to keep:

    DataFile.objects.values('resolution', 'region', 'tile_label').annotate(maxfield=Max('timestamp'))
    

    This query gives you the maximum timestamp, but it doesn't tell you which row it belongs to. You could get that in a second query, but that seems complicated to me.

    I tried to think about which property a DataFile would need to have to qualify for deletion. The answer I found was that there must be a newer DataFile with the same resolution/region/tile_label. When I translate that to Django I get something like this:

    newer = DataFile.objects.filter(
         resolution=OuterRef('resolution'),
         region=OuterRef('region'),
         tile_label=OuterRef('tile_label'),
         timestamp__gt=OuterRef('timestamp'),
    )
    
    DataFile.objects.filter(Exists(newer)).delete()
    

    The Exists creates a Subquery and OuterRef can be used to reference fields from the outer query.

    https://docs.djangoproject.com/en/4.1/ref/models/expressions/#filtering-on-a-subquery-or-exists-expressions

    In SQL this would look someting like this (not tested):

    DELETE from datafiles WHERE EXISTS (
         SELECT id from datafiles as datafiles2 WHERE (
             datafiles2.resolution = datafiles.resolution
             AND datafiles2.region = datafiles.region
             AND datafiles2.tile_label = datafiles.tile_label
             AND datafiles2.timestamp >= datafiles.timestamp
         )
    )