Search code examples
time-seriesgrafanainfluxdb

InfluxDB Continuous Query running on entire time series data


If my interpretation is correct, according to the documentation provided here:InfluxDB Downsampling when we down-sample data using a Continuous Query running every 30 minutes, it runs only for the previous 30 minutes data.

Relevant part of the document:

Use the CREATE CONTINUOUS QUERY statement to generate a CQ:

 CREATE CONTINUOUS QUERY "cq_30m" ON "food_data" BEGIN
  SELECT mean("website") AS "mean_website",mean("phone") AS "mean_phone"
  INTO "a_year"."downsampled_orders"
  FROM "orders"
  GROUP BY time(30m)
END

That query creates a CQ called cq_30m in the database food_data. cq_30m tells InfluxDB to calculate the 30-minute average of the two fields website and phone in the measurement orders and in the DEFAULT RP two_hours. It also tells InfluxDB to write those results to the measurement downsampled_orders in the retention policy a_year with the field keys mean_website and mean_phone. InfluxDB will run this query every 30 minutes for the previous 30 minutes.

When I create a Continuous Query it actually runs on the entire dataset, and not on the previous 30 minutes. My question is, does this happen only the first time after which it runs on the previous 30 minutes of data instead of the entire dataset?

I understand that the query itself uses GROUP BY time(30m) which means it'll return all data grouped together but does this also hold true for the Continuous Query? If so, should I then include a filter to only process the last 30 minutes of data in the Continuous Query?


Solution

  • What you have described is expected functionality.

    Schedule and coverage Continuous queries operate on real-time data. They use the local server’s timestamp, the GROUP BY time() interval, and InfluxDB database’s preset time boundaries to determine when to execute and what time range to cover in the query.

    CQs execute at the same interval as the cq_query’s GROUP BY time() interval, and they run at the start of the InfluxDB database’s preset time boundaries. If the GROUP BY time() interval is one hour, the CQ executes at the start of every hour.

    When the CQ executes, it runs a single query for the time range between now() and now() minus the GROUP BY time() interval. If the GROUP BY time() interval is one hour and the current time is 17:00, the query’s time range is between 16:00 and 16:59.999999999.

    So it should only process the last 30 minutes.

    Its a good point about the first run.

    I did manage to find a snippet from an old document

    Backfilling Data In the event that the source time series already has data in it when you create a new downsampled continuous query, InfluxDB will go back in time and calculate the values for all intervals up to the present. The continuous query will then continue running in the background for all current and future intervals.

    https://influxdbcom.readthedocs.io/en/latest/content/docs/v0.8/api/continuous_queries/#backfilling-data

    Which would explain the behaviour you have found