Search code examples
sqlpostgresqlaggregate-functionswindow-functionsgenerate-series

How to create an efficient query which will count of the records by a specific time interval?


Which database I use?

I use a PostgreSQL 9.5.

What I need?

This's a part of my data_store tables:

  id |          starttime
-----+----------------------------
 185 | 2011-09-12 15:24:03.248+02
 189 | 2011-09-12 15:24:03.256+02    
 312 | 2011-09-12 15:24:06.112+02
 313 | 2011-09-12 15:24:06.119+02
 450 | 2011-09-12 15:24:09.196+02
 451 | 2011-09-12 15:24:09.203+02
 452 | 2011-09-12 15:24:09.21+02
 ... |            ...

I would like to create the query which will count of the records by the specific time interval. For example, for the 4 seconds time interval - the query should return to me something like this:

    starttime-from   |    starttime-to     |  count
---------------------+---------------------+---------
 2011-09-12 15:24:03 | 2011-09-12 15:24:07 |    4
 2011-09-12 15:24:07 | 2011-09-12 15:24:11 |    3
 2011-09-12 15:24:11 | 2011-09-12 15:24:15 |    0
         ...         |         ...         |   ...

The most important things:

  1. The time interval depends on the user's choice. It could be 1 second, 37 seconds, 50 minutes or some mix: 2 month and 30 mintues. The available units for the time interval: millisecond, second, minute, hour, day, month, year. How you see, I need some generic/universal query for that BUT I could also create several query for each unit - it isn't a problem.
  2. The query should be efficient, because I work in a large database (20 million rows and more but in query I use only a part of this database, for example: 1 million).

The question is: How should the query look like to achieve that?

I tried to convert the solutions which I found in the following threads, but I didn't succeed:

What I have?

I deleted this section of my post for greater transparency of the post. This section wasn't necessary to give an answer my question. If you want to see what here was, look at the history of the post.


Solution

  • Your query seems complicated. You only need to generate the sequence of times and then use left join to bring them together . . . and aggregate:

    select g.ts,  g.ts + interval '4 second', count(ds.id)
    from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
          from data_store
         ) g left join
         data_store ds
         on ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
    group by g.ts
    order by g.ts;
    

    Note: If you want the interval to begin on an exact second (and not have some strange number of milliseconds 999 times out of 1000), then use date_trunc().

    EDIT:

    It might be worth seeing if a correlated subquery is faster:

    select gs.ts,
           (select count(*)
            from data_store ds
            where ds.starttime >= g.ts and ds.starttime < g.ts + interval '4 second'
           ) as cnt
    from (select generate_series(min(starttime), max(strttime), interval '4 second') as ts
          from data_store
         ) g;