Search code examples
pythondjangopostgresqldjango-orm

How can I annotate a django queryset with more than one values


I have an application that scrapes some eshops every day, and I record the Price of every Item.

A simplified version of my models looks like this:

class Item(models.Model):
    name = models.CharField(max_length=512, blank=True)


class Price(models.Model):
    item = models.ForeignKey("Item", on_delete=models.CASCADE)
    timestamp = models.DateTimeField(blank=True)
    per_item = models.FloatField(blank=True)

And my tables can look like this:

# Item

id  | name
-----------
1i  | item-1
2i  | item-2
3i  | item-3

# Price

id  | item   | timestamp   | per_item
--------------------------------------
1p  | i1     | 2023-04-10  | 10.0
2p  | i2     | 2023-04-10  | 2.0
3p  | i3     | 2023-04-10  | 14.5

4p  | i1     | 2023-04-11  | 11.0
5p  | i3     | 2023-04-11  | 12.5

6p  | i2     | 2023-04-12  | 2.0
7p  | i3     | 2023-04-12  | 14.5

8p  | i1     | 2023-04-13  | 12.0
9p  | i2     | 2023-04-13  | 3.5
10p | i3     | 2023-04-13  | 15.0

Note that not all Items exist every day, some of them may be present today, missing tomorrow and back again on the day after, and so on.

What I want to do

I'm trying to write a query that for a given day it will give me the last 2 Prices for every Item that existed that day.

So, if I was looking at the last day (2023-04-13) I would get something like this:

[
{'id': '1i', 'latest_prices': [{'price__id': '8p', 'price__per_item': 12.0}, {'price__id': '4p', 'price__per_item': 11.0}]
{'id': '2i', 'latest_prices': [{'price__id': '9p', 'price__per_item': 3.5}, {'price__id': '6p', 'price__per_item': 2.0}]
{'id': '3i', 'latest_prices': [{'price__id': '10p', 'price__per_item': 15.0}, {'price__id': '7p', 'price__per_item': 14.5}]
]

I have ~25k items every day, so I'd like not to:

  • Have to go to the database for every Price to retrieve its per_item field
  • Have to loop through every Item of every day, to retrieve its last 2 Prices

What I have tried

  1. I have tried using annotate and Subquery, but I was getting django.db.utils.ProgrammingError: more than one row returned by a subquery used as an expression
  2. Then I found that on postgres I could use an ArraySubquery but I could only return multiple price__id, not multiple fields
  3. Then I looked closer at the example in the ArraySubquery link and I noticed the JSONObject - however I'm probably holding it the wrong way and I always get a single item in the latest_prices annotation.

I suspect that the solution is in the JSONObject approach, but somehow it's just out of reach for my mind at the moment, so any steer in the right direction is really appreciated :)


Solution

  • Right, so the answer is indeed a combination of ArraySubquery and JSONObject and it looks like this:

    subquery = (
        Price.objects.filter(item__id=OuterRef("id"))
            .filter(timestamp__lt=end_of_today)
            .order_by("-retrieval__timestamp")
            .values(json=JSONObject(price_id="id", per_item="per_item"))[:2]
        )
    
    items_with_prices = (
        Item.objects.filter(id__in=item_ids)
            .annotate(latest_prices=ArraySubquery(subquery))
            .values("id", "latest_prices")
    )
    

    The reason it was not working for me is because I had forgotten a debugging statement that restricted how many days back I could look 🤦‍♂️