database database-administration arangodb aql

Aggregating averages from large datasets for number of steps over period of time in ArangoDB

I previously touched on this problem in my post "What's the best way to return a sample of data over the a period?" but while the potential solutions offered there were promising, ultimately they didn't solve the problem I have (which has since changed slightly too).

The problem

The problem I'm trying to solve is returning aggregated data from a large dataset split into time-bound chunks quickly. For example, from a collection of temperature readings, return the average hourly reading for the last 24 hours.

For this example, let's say we have a collection named observations that contains temperature data collected from many devices at a rate of 1 per second. This collection contains a large amount of data (the dataset I'm working with has 120M documents). Each collection contains the following three fields: deviceId, timestamp, temperature.

For clarity, this gives us:

200 devices
3,600 documents per device per hour
86,400 documents per device per day
17,280,000 documents per day (all devices)
120,960,000 documents per week (all devices)

Retrieving data for a device for a period of time is trivial:

FOR o IN observations
FILTER o.deviceId = @deviceId
FILTER o.timestamp >= @start AND o.timestamp <= @end
RETURN o

The issue comes when trying to return aggregated data. Let's say that we want to return three sets of data for a specific deviceId:

The average daily reading for the last week (7 records from 17,280,000)
The average hourly readings for the last day (24 records from 86,400)
The average minutely readings for the last hour (60 records from 3,600)

(In case it affects the potential solutions, some of the data may not be per second, but at a lower rate, for example, per 15 seconds, perhaps even per minute. There may also be missing data for certain periods of time. While I've assumed ideal for the above figures, we all know reality is rarely that simple.)

What I've tried so far

I've looked into using WINDOW (example below) but the query ran very slow, whether this was an issue with the query or with volume of data, I don't know, but I couldn't find much information on it. Also, this still requires a way to perform multiple readings, one per each period of time (what I think of as a step).

FOR o IN observations
FILTER o.deviceId = @deviceId
WINDOW DATE_TIMESTAMP(o.timestamp) WITH { preceding: "PT60M" }
AGGREGATE temperature = AVG(o.temperature)
RETURN {
  timestamp: o.timestamp,
  temperature
}

I also looked at ways of filtering the timestamps based on a modulus as suggested in the previous thread but that didn't account for averages or for data that may be missing (a missed update, so no record with an exact timestamp, for example).

I've pulled out the data and filtered it outside of ArangoDB but this isn't really a solution, it's slow and especially for large volumes of data (17M for a week of readings) it just wasn't working at all.

So I looked at recreating some of that logic in the query, by stepping through each chunk of data, and returning the average values, which works, but isn't the most performant (taking roughly 10s for me, albeit relatively low powered box but still, slow):

LET steps = 24
LET stepsRange = 0..23
LET diff = @end - @start
LET interval = diff / steps

LET filteredObservations = (
    FOR o IN observations
    FILTER o.deviceId == @deviceId
    FILTER o.timestamp >= @start AND o.timestamp <= @end
    RETURN o
)

FOR step IN stepsRange
RETURN (
    LET stepStart = start + (interval * step)
    LET stepEnd = stepStart + interval

    FOR f IN filteredObservations
    FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
    COLLECT AGGREGATE temperature = AVG(f.temperature)
    RETURN { step, temperature }
)

I've also tried variations of the above using WINDOW but without much luck. I'm not massively across the graph functionality of ArangoDB, coming from a relational/document database background, so I wonder if there is something in that which could make querying this data quicker and easier.

Summary

I expect this query to be ran simultaneously for several different time ranges and devices by many users, so really the performance needs to be < 1s. In terms of compromises, if this could be achieved by picking one record from each time chunk, that would be okay.

Solution

I figured out the answer to this at about 2.00 am last night, waking up and scribbling down a general idea. I've just tested it and it seems to be running quite quickly.

My thought was this: grabbing an average between two timestamps is a quick query, so if we simplify the overall query to simply run a filtered aggregate for each timestep, then the performance of the query will simply be a linear cost depending on the number of datapoints required.

It wasn't much different than the example above:

# Slow method

LET steps = 24
LET stepsRange = 0..23
LET diff = @end - @start
LET interval = diff / steps

LET filteredObservations = (
    FOR o IN observations
    FILTER o.deviceId == @deviceId
    FILTER o.timestamp >= @start AND o.timestamp <= @end
    RETURN o
)

FOR step IN stepsRange
RETURN (
    LET stepStart = start + (interval * step)
    LET stepEnd = stepStart + interval

    FOR f IN filteredObservations
    FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
    COLLECT AGGREGATE temperature = AVG(f.temperature)
    RETURN { step, temperature }
)

Instead of filtering the observations first, and then filtering this subset again and again to collect the aggregate, we instead just loop through from 0 to n, in this case 23, and run the query to filter and aggregate the result:

# Quick method

LET steps = 24
LET stepsRange = 0..23
LET diff = @end - @start
LET interval = diff / steps

FOR step IN stepsRange
RETURN FIRST(
    LET stepStart = start + (interval * step)
    LET stepEnd = stepStart + interval

    RETURN FIRST(
        FOR f IN filteredObservations
        FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
        COLLECT AGGREGATE temperature = AVG(f.temperature)
        RETURN temperature
    )
)

In my case, the total query time is around 75 ms for 24 datapoints, hourly average from 24 hours worth of data. Increasing this up to 48 points only increases the query run time by about 25%, and returning 1440 steps (per minute averages) runs in 132 ms.

I'd say this qualifies as a performant query.

(It's worth noting that I have a persistent index for this collection, without which the query is very slow.)