Search code examples
postgresqlgroup-by

Grouping rows by day in unixtime


I have a table with DDL:

CREATE TABLE public.test_table (
    id serial4 NOT NULL,
    "date" int8 NOT NULL,
    user_id int4 NOT NULL,
    device_id int4 NULL,
    CONSTRAINT test_table_date_user_id_device_id_key UNIQUE (date, user_id, device_id),
    CONSTRAINT test_table_pkey PRIMARY KEY (id)
);

ALTER TABLE public.test_table ADD CONSTRAINT test_table FOREIGN KEY (device_id) REFERENCES public.t_device(id);
ALTER TABLE public.test_table ADD CONSTRAINT test_table_user_id_fkey FOREIGN KEY (user_id) REFERENCES public.t_user(id) ON DELETE CASCADE;

Sample:

| id       | date           | user_id |
| -------- | -------------- | ------- |
| 1886     | 1716625890     | 5       |
| 1887     | 1716626430     | 5       |
| 1888     | 1716627030     | 5       |

And I need to find rows count grouped by days and also I need to print the first and the last dates in each day. As an input I have two dates: from and to and user_id.

I'm trying to use the following sql:

with calendar as (
select
    d
from
    generate_series(to_timestamp(:FROM)::date,
    to_timestamp(:TO)::date,
    interval '1 day') d)

select
    c.d::date as item_date,
    count(dt.id) as item_count,
    min(dt."date") as item_min,
    max(dt."date") as item_max
from
    test_table dt 
left join
     calendar c
     on
    to_timestamp(dt.date)::date >= c.d
    and
        to_timestamp(dt.date)::date < c.d + interval '1 day'
where dt.user_id = 5
group by
    c.d
order by
    c.d;

My current output:

|item_date |item_count|item_min     |item_max     |
|----------|----------|-------------|-------------|
|2024-05-25|144       |1,716,584,490|1,716,670,230|
|2024-05-26|144       |1,716,670,816|1,716,756,630|
|2024-05-27|144       |1,716,757,221|1,716,843,030|
|          |4,770     |1,716,286,230|1,719,514,890|

explain(analyze, verbose, buffers, settings) result:

Sort  (cost=199067.86..199068.36 rows=200 width=36) (actual time=10.152..10.154 rows=3 loops=1)
  Output: ((d.d)::date), (count(dt.id)), (min(dt.date)), (max(dt.date)), d.d
  Sort Key: d.d
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=363
  ->  HashAggregate  (cost=199057.72..199060.22 rows=200 width=36) (actual time=10.144..10.146 rows=3 loops=1)
        Output: (d.d)::date, count(dt.id), min(dt.date), max(dt.date), d.d
        Group Key: d.d
        Buffers: shared hit=363
        ->  Nested Loop  (cost=0.30..192966.61 rows=609111 width=20) (actual time=0.292..10.074 rows=432 loops=1)
              Output: d.d, dt.id, dt.date
              Join Filter: (((to_timestamp((dt.date)::double precision))::date >= d.d) AND ((to_timestamp((dt.date)::double precision))::date < (d.d + '1 day'::interval)))
              Rows Removed by Join Filter: 15174
              Buffers: shared hit=363
              ->  Function Scan on pg_catalog.generate_series d  (cost=0.01..10.01 rows=1000 width=8) (actual time=0.006..0.008 rows=3 loops=1)
                    Output: d.d
                    Function Call: generate_series((('2024-05-25 11:31:30+03'::timestamp with time zone)::date)::timestamp with time zone, (('2024-05-27 14:50:30+03'::timestamp with time zone)::date)::timestamp with time zone, '1 day'::interval)
              ->  Materialize  (cost=0.29..1100.30 rows=5482 width=12) (actual time=0.011..1.198 rows=5202 loops=3)
                    Output: dt.id, dt.date
                    Buffers: shared hit=363
                    ->  Index Scan using test_table_date_user_id_device_id_key on public.test_table dt  (cost=0.29..1072.89 rows=5482 width=12) (actual time=0.032..2.070 rows=5202 loops=1)
                          Output: dt.id, dt.date
                          Index Cond: (dt.user_id = 5)
                          Buffers: shared hit=363
Settings: effective_cache_size = '1377MB', effective_io_concurrency = '200', max_parallel_workers = '1', random_page_cost = '1.1', search_path = 'public, public, "$user"', work_mem = '524kB'
Planning Time: 0.147 ms
Execution Time: 10.264 ms
  • Is the query optimal?
  • Is it possible to remove the last total row in the result? (Updated: fixed, my mistake, thank you Frank Heikens)

Solution

  • Is the query optimal?

    It's not. The CTE is unnecessary - you're already turning your unix timestamp into a date by to_timestamp(dt.date)::date. I don't see a reason to generate mismatched dates with generate_series().

    Is it possible to remove the last total row in the result?

    It's there because 4,770 dates you generated in the CTE don't match anything in test_table. If you remove the CTE, you'll get rid of that last line, shorten, simplify and speed up your query:
    demo at db<>fiddle

    select
        to_timestamp(dt.date)::date as item_date,
        count(dt.id) as item_count,
        min(dt."date") as item_min,
        max(dt."date") as item_max
    from
        test_table dt 
    where dt.user_id = 5
    --and to_timestamp(dt.date)::date >= to_timestamp(:FROM)::date
    --and to_timestamp(dt.date)::date <  to_timestamp(:TO)::date
    group by
        item_date
    order by
        item_date;
    
    item_date item_count item_min item_max
    2024-05-24 144 1716509706 1716591459
    2024-05-25 144 1716595589 1716677784
    2024-05-26 144 1716682103 1716764034

    If generate_series() was there to narrow down to a target period (of what looks like 13 years), you can uncomment the two added conditions above.

    Both bigint and timestamp(tz) are 8 bytes. If you have control over the schema, changing the column type won't cost any additional space and it'll make it easier to work with it, enabling all built-in functions and operators without the need to run it through to_timestamp() every time.

    Nitpicking, because dt.id is likely a primary key implying not null but it's good to make sure you're intentionally not counting rows where dt.id is null:

    count ( * ) → bigint
    Computes the number of input rows.

    count ( "any" ) → bigint
    Computes the number of input rows in which the input value is not null.

    When you use a full or right join, even though the column in the left table isn't nullable, it'll hold a null for all rows without a matching right.