Search code examples
mysqlaggregate-functionsdatestamp

Generating time series reports


I'm trying to work out how to create a solution that will allow me to query a table that has a timestamp, and in return get a time series data. The request consists of start/end date & time, granularity type (minute, hour, day, week, month and year) and granularity value. Having tried to use in a query something like

GROUP BY ROUND(UNIX_TIMESTAMP(created_at) DIV 60)

to get the results per one minute, or DIV 300 for every five minutes is fine. The problem lies further up for calculating months and years' seconds which will be inaccurate. I've stumbled upon the generate_series in PGSQL (MySQL alternative) and am stuck trying to tie them together. How do I calculate a count of rows, for example, for two days, on a 15 minute granularity? It's a complex question that I'll probably have to break down further.

I have already visited #1 and #2, but they are incomplete. To me it seems that rounding will only be allowed to certain level and I'd have to restrict it (i.e .for 2 months period there cannot be hourly breakdown).

EDIT

It gave me the wrong impression - I would not have to calculate monthly figures based on seconds using the query like:

SELECT DATE_FORMAT(MIN(created_at),'%d/%m/%Y %H:%i:%s' as date,
COUNT(*) AS count FROM guests
GROUP BY ROUND(UNIX_TIMESTAMP(created_at) / 300)

It's only going to do grouping based on minimum value. But the question still stands - is the best approach really to go through the time period using granularity value and "slice" the data that way without loosing too much accuracy?

It seems that the only approach is to run sub-queries for a set of data (i.e. for a period of two months, generate 15 minute intervals timestamps, group the data into them and produce an aggregate) without dividing the original timestamp to produce the rounded approximation.


Solution

  • Let's say you have a gigantic table measure with two columns datestamp and temp.

    Let's say you want to see the temperature every six minutes (10x per hour) for the last week. You can do this sort of thing. We'll get to defining trunc in a moment.

      SELECT trunc(datestamp) datestamp, AVG(temp) temp
        FROM measure
       WHERE datestamp >= CURDATE() - INVERVAL 7 DAY
      GROUP BY trunc(datestamp)
      ORDER BY trunc(datestamp)
    

    That works for any reasonable definition of trunc. In this case trunc(t) returns the beginning of the six-minute period in which t occurs. So, trunc('1942-12-07 08:45:17') gives 1942-12-07 08:42:00).

    Here's a query that works for every six minute interval.

      SELECT DATE_FORMAT(datestamp,'%Y-%m-%d %H:00') +
                INTERVAL (MINUTE(datestamp) -
                          MINUTE(datestamp) MOD 6) datestamp, 
             AVG(temp) temp
        FROM measure
       WHERE datestamp >= CURDATE() - INVERVAL 7 DAY
       GROUP BY DATE_FORMAT(datestamp,'%Y-%m-%d %H:00') +
                INTERVAL (MINUTE(datestamp) -
                          MINUTE(datestamp) MOD 6)
      ORDER BY 1
    

    This uses inbuilt date arithmetic rather than unix timestamp arithmetic.

    You can use a stored function to make this easier to read.

    DELIMITER $$
    DROP FUNCTION IF EXISTS TRUNC_N_MINUTES$$
    CREATE
      FUNCTION TRUNC_N_MINUTES(datestamp DATETIME, n INT)
      RETURNS DATETIME DETERMINISTIC NO SQL
      COMMENT 'truncate to N minute boundary. For example,
               TRUNCATE_N_MINUTES(sometime, 15) gives the nearest
               preceding quarter hour'
      RETURN DATE_FORMAT(datestamp,'%Y-%m-%d %H:00') +
                    INTERVAL (MINUTE(datestamp) -
                              MINUTE(datestamp) MOD n) MINUTE$$
    DELIMITER ;
    

    Then your query will say

      SELECT TRUNC_N_MINUTES(datestamp, 6) datestamp, AVG(temp) temp
        FROM measure
       WHERE datestamp >= CURDATE() - INVERVAL 7 DAY
      GROUP BY TRUNC_N_MINUTES(datestamp, 6)
      ORDER BY TRUNC_N_MINUTES(datestamp, 6)
    

    If you want to summarize by 5, 10, 15, or minute boundaries (three items per hour) simply use that number in place of 6.

    You'll need different trunc() functions for hours, etc.

    The trunc() function for daily summaries is DATE(datestamp). For monthly summaries it is LAST_DAY(datestamp). For example,

      SELECT LAST_DAY(datestamp) month_ending, AVG(temp) temp
        FROM measure
      GROUP BY LAST_DAY(datestamp) 
      ORDER BY LAST_DAY(datestamp) 
    

    yields a month-by-month summary.