sql postgresql time-series aggregate-functions greatest-n-per-group

Create date-aggregate statistics of product sales and prices in plain SQL

I have the following two tables:

CREATE TABLE products 
(
    id INT,
    created_at DATE,
    sold_at DATE
);

CREATE TABLE product_prices 
(
    id INT,
    product_id INT,
    price numeric,
    created_at DATE
);

The data model logic works as follows:

When a new product is put for sale, a record is inserted into products with the current date as created_at.
At the same time, a record is inserted into product_prices with the same date in created_at, an FK reference to the product and a price.
If a product is sold, the sold_at is set on the products record.
If the product changes price throughout its sales period a new record is added to product_prices with the created_at date the price was changed. This means, that if you wanna know what price a product has at a given date, then you need to check what the price was at that date by looking at product_prices.

Now imagine that I have seed data something like this:

SELECT * FROM products;

id	created_at	sold_at
1	2022-01-25T00:00:00.000Z	2022-02-18T00:00:00.000Z
2	2022-01-26T00:00:00.000Z
3	2022-01-28T00:00:00.000Z	2022-01-30T00:00:00.000Z
4	2022-02-01T00:00:00.000Z	2022-02-01T00:00:00.000Z
5	2022-02-01T00:00:00.000Z	2022-02-15T00:00:00.000Z
6	2022-02-10T00:00:00.000Z	2022-02-13T00:00:00.000Z
7	2022-02-14T00:00:00.000Z
8	2022-02-19T00:00:00.000Z
9	2022-02-20T00:00:00.000Z	2022-02-22T00:00:00.000Z
10	2022-02-22T00:00:00.000Z

and

SELECT * FROM product_prices;

id	product_id	price	created_at
1	1	100.0	2022-01-25T00:00:00.000Z
2	1	95.0	2022-02-02T00:00:00.000Z
3	1	85.0	2022-02-17T00:00:00.000Z
4	2	89.0	2022-01-26T00:00:00.000Z
5	2	85.0	2022-01-30T00:00:00.000Z
6	3	91.0	2022-01-28T00:00:00.000Z
7	4	50.0	2022-02-01T00:00:00.000Z
8	5	100.0	2022-02-01T00:00:00.000Z
9	5	99.0	2022-02-03T00:00:00.000Z
10	6	79.0	2022-02-10T00:00:00.000Z
11	6	75.0	2022-02-11T00:00:00.000Z
12	6	71.0	2022-02-12T00:00:00.000Z
13	7	120.0	2022-02-14T00:00:00.000Z
14	7	110.0	2022-02-16T00:00:00.000Z
15	8	89.0	2022-02-19T00:00:00.000Z
16	9	30.0	2022-02-20T00:00:00.000Z
17	9	29.0	2022-02-22T00:00:00.000Z
18	10	100.0	2022-02-22T00:00:00.000Z

I want to know what was the average price and the number of products for sale and the number of sold products on a daily basis between 2022-01-23 and 2022-02-23.

In pseudo SQL it would be something like:

SELECT 
    COUNT(products_for_sale_this_day), 
    COUNT(products_sold_this_day), 
    AVG(price_of_products_for_sale_on_this_day) 
FROM 
    products ... 
WHERE 
    date "is between 2022-01-23 and 2022-02-23" 
GROUP BY 
    "dates in between"`

The result I would expect from the seed data would be:

Products for sale	Number of sold	Avg price	Date
0	0	0.0	2022-01-23
0	0	0.0	2022-01-24
1	0	xx.xx	2022-01-25
2	0	xx.xx	2022-01-26
2	0	xx.xx	2022-01-27
3	0	xx.xx	2022-01-28
3	0	xx.xx	2022-01-29
3	1	92.0	2022-01-30
2	0	xx.xx	2022-01-31
4	1	xx.xx	2022-02-01
3	0	xx.xx	2022-02-02
3	0	xx.xx	2022-02-03
3	0	xx.xx	2022-02-04
3	0	xx.xx	2022-02-05
3	0	xx.xx	2022-02-06
3	0	xx.xx	2022-02-07
3	0	xx.xx	2022-02-08
3	0	xx.xx	2022-02-09
4	0	xx.xx	2022-02-10
4	0	xx.xx	2022-02-11
4	0	xx.xx	2022-02-12
4	1	xx.xx	2022-02-13
4	0	xx.xx	2022-02-14
4	1	xx.xx	2022-02-15
3	0	xx.xx	2022-02-16
3	0	xx.xx	2022-02-17
3	1	xx.xx	2022-02-18
3	0	xx.xx	2022-02-19
4	0	xx.xx	2022-02-20
4	0	xx.xx	2022-02-21
5	1	xx.xx	2022-02-22
4	0	xx.xx	2022-02-23

NOTE: I added xx.xx as I didn't want to manually calculate the AVG for every day in the example. On the 2022-01-30 the average price comes from the following products being for sale with the following prices:

Product ID 1, price at 2022-01-30: 100.0
Product ID 2, price at 2022-01-30: 85.0
Product ID 3, price at 2022-01-30: 91.0

AVG: (100 + 85 + 91) / 3 = 92

https://www.db-fiddle.com/f/jxQAqLvnKExjTVr16XyonG/0

Solution

SELECT d.the_day                                        AS "Date"
     , count(p.id)                                      AS "Products for sale"
     , count(p.id) FILTER (WHERE p.sold_at = d.the_day) AS "Number of sold"
     , round(coalesce(avg(pp.price), 0), 2)             AS "Avg price"
FROM  (
   SELECT ts::date AS the_day 
   FROM   generate_series (timestamp '2022-01-23'
                         , timestamp '2022-02-23'
                         , interval  '1 day') ts
   ) d
LEFT  JOIN products p ON p.created_at <= d.the_day
                     AND (p.sold_at < d.the_day) IS NOT TRUE
LEFT  JOIN LATERAL (
   SELECT pp.price
   FROM   product_prices pp
   WHERE  pp.product_id = p.id
   AND    pp.created_at <= d.the_day
   ORDER  BY pp.created_at DESC
   LIMIT  1
   ) pp ON true
GROUP  BY 1
ORDER  BY 1;

fiddle

I only join to products that are already listed and not yet sold to begin with. The day of the sale is included. Notably, products sold "today" are counted among "for sale" as well as among "sold", like your expected result demonstrates.

You show type date, but timestamptz literals. I went with dates like in your fiddle. Data types matter ...

Why did I use generate_series() this way?

Generating time series between two dates in PostgreSQL

About the LATERAL join:

What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?

About getting the latest applicable price:

Optimize GROUP BY query to retrieve latest row per user

About the aggregate FILTER clause:

Conditional SQL count