I want to group my model objects by three fields, and delete all objects but the youngest for each group.
My model:
class DataFile(models.Model):
filename = models.CharField(unique=True, max_length=256)
timestamp = models.DateTimeField()
tile_label = models.CharField(max_length=256, blank=True)
region = models.CharField(max_length=256, blank=True)
resolution = models.CharField(max_length=256, blank=True)
This query gives me the list of the values I want to keep:
DataFile.objects.values('resolution', 'region', 'tile_label').annotate(maxfield=Max('timestamp'))
The result of this is
{'resolution': '30', 'region': 'SRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 9, 11, 7, 13, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'NRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 8, 16, 2, 8, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'NRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 15, 23, 5, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'SRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 11, 1, 13, 46, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'NRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 11, 5, 12, 20, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'NRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 18, 5, 54, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'SRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 5, 21, 8, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'SRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 18, 8, 29, tzinfo=<UTC>)}
I now need to filter all DataFile
objects except the one with the highest date in each group of resolution
, region
, tile_label
, and then delete them (i.e. keep the one with the highest date).
How do I do this? I need to do a GROUP BY with values()
in order to get the maximum of each group, but this means that I'm not operating on the whole queryset any more.
I think there's a solution involving subqueries, but I never really got my head around how they work.
I need a database solution, so looping is not an option. I'd also prefer to avoid __in
queries for performance reasons (but would ultimately do it if there's no other solution).
This query gives me the list of the values I want to keep:
DataFile.objects.values('resolution', 'region', 'tile_label').annotate(maxfield=Max('timestamp'))
This query gives you the maximum timestamp, but it doesn't tell you which row it belongs to. You could get that in a second query, but that seems complicated to me.
I tried to think about which property a DataFile would need to have to qualify for deletion. The answer I found was that there must be a newer DataFile with the same resolution
/region
/tile_label
. When I translate that to Django I get something like this:
newer = DataFile.objects.filter(
resolution=OuterRef('resolution'),
region=OuterRef('region'),
tile_label=OuterRef('tile_label'),
timestamp__gt=OuterRef('timestamp'),
)
DataFile.objects.filter(Exists(newer)).delete()
The Exists
creates a Subquery and OuterRef
can be used to reference fields from the outer query.
In SQL this would look someting like this (not tested):
DELETE from datafiles WHERE EXISTS (
SELECT id from datafiles as datafiles2 WHERE (
datafiles2.resolution = datafiles.resolution
AND datafiles2.region = datafiles.region
AND datafiles2.tile_label = datafiles.tile_label
AND datafiles2.timestamp >= datafiles.timestamp
)
)