I have an application that scrapes some eshops every day, and I record the Price
of every Item
.
A simplified version of my models looks like this:
class Item(models.Model):
name = models.CharField(max_length=512, blank=True)
class Price(models.Model):
item = models.ForeignKey("Item", on_delete=models.CASCADE)
timestamp = models.DateTimeField(blank=True)
per_item = models.FloatField(blank=True)
And my tables can look like this:
# Item
id | name
-----------
1i | item-1
2i | item-2
3i | item-3
# Price
id | item | timestamp | per_item
--------------------------------------
1p | i1 | 2023-04-10 | 10.0
2p | i2 | 2023-04-10 | 2.0
3p | i3 | 2023-04-10 | 14.5
4p | i1 | 2023-04-11 | 11.0
5p | i3 | 2023-04-11 | 12.5
6p | i2 | 2023-04-12 | 2.0
7p | i3 | 2023-04-12 | 14.5
8p | i1 | 2023-04-13 | 12.0
9p | i2 | 2023-04-13 | 3.5
10p | i3 | 2023-04-13 | 15.0
Note that not all Items exist every day, some of them may be present today, missing tomorrow and back again on the day after, and so on.
I'm trying to write a query that for a given day it will give me the last 2 Prices for every Item that existed that day.
So, if I was looking at the last day (2023-04-13) I would get something like this:
[
{'id': '1i', 'latest_prices': [{'price__id': '8p', 'price__per_item': 12.0}, {'price__id': '4p', 'price__per_item': 11.0}]
{'id': '2i', 'latest_prices': [{'price__id': '9p', 'price__per_item': 3.5}, {'price__id': '6p', 'price__per_item': 2.0}]
{'id': '3i', 'latest_prices': [{'price__id': '10p', 'price__per_item': 15.0}, {'price__id': '7p', 'price__per_item': 14.5}]
]
I have ~25k items every day, so I'd like not to:
Price
to retrieve its per_item
fieldItem
of every day, to retrieve its last 2 Price
sannotate
and Subquery
, but I was getting django.db.utils.ProgrammingError: more than one row returned by a subquery used as an expression
ArraySubquery
but I could only return multiple price__id
, not multiple fieldsArraySubquery
link and I noticed the JSONObject
- however I'm probably holding it the wrong way and I always get a single item in the latest_prices
annotation.I suspect that the solution is in the JSONObject
approach, but somehow it's just out of reach for my mind at the moment, so any steer in the right direction is really appreciated :)
Right, so the answer is indeed a combination of ArraySubquery
and JSONObject
and it looks like this:
subquery = (
Price.objects.filter(item__id=OuterRef("id"))
.filter(timestamp__lt=end_of_today)
.order_by("-retrieval__timestamp")
.values(json=JSONObject(price_id="id", per_item="per_item"))[:2]
)
items_with_prices = (
Item.objects.filter(id__in=item_ids)
.annotate(latest_prices=ArraySubquery(subquery))
.values("id", "latest_prices")
)
The reason it was not working for me is because I had forgotten a debugging statement that restricted how many days back I could look 🤦♂️