I previously touched on this problem in my post "What's the best way to return a sample of data over the a period?" but while the potential solutions offered there were promising, ultimately they didn't solve the problem I have (which has since changed slightly too).
The problem I'm trying to solve is returning aggregated data from a large dataset split into time-bound chunks quickly. For example, from a collection of temperature readings, return the average hourly reading for the last 24 hours.
For this example, let's say we have a collection named observations
that contains temperature data collected from many devices at a rate of 1 per second. This collection contains a large amount of data (the dataset I'm working with has 120M documents). Each collection contains the following three fields: deviceId
, timestamp
, temperature
.
For clarity, this gives us:
200
devices3,600
documents per device per hour86,400
documents per device per day17,280,000
documents per day (all devices)120,960,000
documents per week (all devices)Retrieving data for a device for a period of time is trivial:
FOR o IN observations
FILTER o.deviceId = @deviceId
FILTER o.timestamp >= @start AND o.timestamp <= @end
RETURN o
The issue comes when trying to return aggregated data. Let's say that we want to return three sets of data for a specific deviceId
:
(In case it affects the potential solutions, some of the data may not be per second, but at a lower rate, for example, per 15 seconds, perhaps even per minute. There may also be missing data for certain periods of time. While I've assumed ideal for the above figures, we all know reality is rarely that simple.)
I've looked into using WINDOW
(example below) but the query ran very slow, whether this was an issue with the query or with volume of data, I don't know, but I couldn't find much information on it. Also, this still requires a way to perform multiple readings, one per each period of time (what I think of as a step
).
FOR o IN observations
FILTER o.deviceId = @deviceId
WINDOW DATE_TIMESTAMP(o.timestamp) WITH { preceding: "PT60M" }
AGGREGATE temperature = AVG(o.temperature)
RETURN {
timestamp: o.timestamp,
temperature
}
I also looked at ways of filtering the timestamps based on a modulus as suggested in the previous thread but that didn't account for averages or for data that may be missing (a missed update, so no record with an exact timestamp, for example).
I've pulled out the data and filtered it outside of ArangoDB but this isn't really a solution, it's slow and especially for large volumes of data (17M for a week of readings) it just wasn't working at all.
So I looked at recreating some of that logic in the query, by stepping through each chunk of data, and returning the average values, which works, but isn't the most performant (taking roughly 10s for me, albeit relatively low powered box but still, slow):
LET steps = 24
LET stepsRange = 0..23
LET diff = @end - @start
LET interval = diff / steps
LET filteredObservations = (
FOR o IN observations
FILTER o.deviceId == @deviceId
FILTER o.timestamp >= @start AND o.timestamp <= @end
RETURN o
)
FOR step IN stepsRange
RETURN (
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN { step, temperature }
)
I've also tried variations of the above using WINDOW
but without much luck. I'm not massively across the graph functionality of ArangoDB, coming from a relational/document database background, so I wonder if there is something in that which could make querying this data quicker and easier.
I expect this query to be ran simultaneously for several different time ranges and devices by many users, so really the performance needs to be < 1s. In terms of compromises, if this could be achieved by picking one record from each time chunk, that would be okay.
I figured out the answer to this at about 2.00 am last night, waking up and scribbling down a general idea. I've just tested it and it seems to be running quite quickly.
My thought was this: grabbing an average between two timestamps is a quick query, so if we simplify the overall query to simply run a filtered aggregate for each timestep, then the performance of the query will simply be a linear cost depending on the number of datapoints required.
It wasn't much different than the example above:
# Slow method
LET steps = 24
LET stepsRange = 0..23
LET diff = @end - @start
LET interval = diff / steps
LET filteredObservations = (
FOR o IN observations
FILTER o.deviceId == @deviceId
FILTER o.timestamp >= @start AND o.timestamp <= @end
RETURN o
)
FOR step IN stepsRange
RETURN (
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN { step, temperature }
)
Instead of filtering the observations first, and then filtering this subset again and again to collect the aggregate, we instead just loop through from 0 to n, in this case 23, and run the query to filter and aggregate the result:
# Quick method
LET steps = 24
LET stepsRange = 0..23
LET diff = @end - @start
LET interval = diff / steps
FOR step IN stepsRange
RETURN FIRST(
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
RETURN FIRST(
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN temperature
)
)
In my case, the total query time is around 75 ms for 24 datapoints, hourly average from 24 hours worth of data. Increasing this up to 48 points only increases the query run time by about 25%, and returning 1440 steps (per minute averages) runs in 132 ms.
I'd say this qualifies as a performant query.
(It's worth noting that I have a persistent index for this collection, without which the query is very slow.)