I have a sparse sensors table, with 120 sensor columns, in which I am getting just a few sensor values at any given timestamp, so most of my values are null, as in the image below.
When I want to query data from any given sensor, of from all, I am first SAMPLING the data with an avg
or a last_not_null
function aggregation, and then I often build a CTE and call LATEST ON
to get results.
SELECT
timestamp,
vehicle_id,
avg(sensor_1) AS avg_sensor_1, avg(sensor_2) AS avg_sensor_2,
...
avg(sensor_119) AS avg_sensor_119, avg(sensor_120) AS avg_sensor_120
FROM
vehicle_sensor_data
-- WHERE vehicle_id = 'AAA0000'
SAMPLE BY 30s
LIMIT 100000;
This works, but it is not super fast (1sec for 10 million rows, in a table with 120 sensor columns and with 10k different vehicle_ids), and I am also aware this is not very efficient because null
columns take some bytes on disk, and in my case this is probably wasteful.
I was thinking of partitioning by hour or day, and at the end of each partition period materialising the SAMPLE BY
query into a dense table, and then dropping the partition from the spare table. That should probably work fine, but I was wondering if anyone had a better suggestion?
A single table works, but there is a more efficient (although a bit more cumbersome if you compose queries by hand) way to do this.
You can create 120 tables, one per sensor, rather than a table with 120 columns. Well, technically you probably want 121 tables, one with the common dimensions, then 1 per sensor. Or maybe you want N tables, one for the common dimensions, then N depending on how many sensor groups you have, as some groups might always send in sync. In any case, rather than a wide table you would end up with several narrow tables that you would need to join, as in the picture below.
Now for joining the tables there are three potential ways, depending on the results you are after:
CROSS JOIN
strategy (example below). This would return a single row.LEFT JOIN
strategy. This would return a single row per series (example below)ASOF JOIN
strategy. This would return as many rows as we have on the main metric we are querying. (example below)I've tested the three approaches and they perform well. The three queries were executed on a table like the initial one, with 10 million rows representing sparse data from 10k series and across 120 metrics, so 120 tables. Each of the 120 tables had ~83k records (which times 120 is ~10 million rows).
CROSS JOIN
is the fastest, executing in 23ms, ASOF JOIN
is second with 123 ms, and LEFT JOIN
is the slowest at 880ms. Still not too bad, as you probably will not want to get all the sensors from all the devices all the time, and joining fewer tables would perform better.
See the example queries below
CROSS JOIN STRATEGY
We first find the latest point in each of the 120 tables for the given series (AAA0000), so we get a value per table, and then do a CROSS JOIN
, to get a single row.
WITH
s1 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_1
WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id),
s2 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_2
WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id),
...
s119 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_119
WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id),
s120 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_120
WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id)
SELECT s1.timestamp, s1.vehicle_id, s1.value AS value_1,
s2.value AS value_2,
...
s119.value AS value_119,
s120.value AS value_120
FROM s1
CROSS JOIN s2
CROSS JOIN ...
CROSS JOIN s119
CROSS JOIN s120;
LEFT JOIN STRATEGY
We first find the latest point in each of the 120 tables for each series, so we get a value per table and series, and then do a LEFT JOIN
on the series ID, to get a single row for each different series (10K rows in our example)
WITH
s1 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_1
LATEST ON timestamp PARTITION BY vehicle_id),
s2 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_2
LATEST ON timestamp PARTITION BY vehicle_id),
...
s119 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_119
LATEST ON timestamp PARTITION BY vehicle_id),
s120 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_120
LATEST ON timestamp PARTITION BY vehicle_id)
SELECT s1.timestamp, s1.vehicle_id, s1.value AS value_1,
s2.value AS value_2,
...
s119.value AS value_119,
s120.value AS value_120
FROM s1
LEFT JOIN s2 ON s1.vehicle_id = s2.vehicle_id
LEFT JOIN ...
LEFT JOIN s119 ON s1.vehicle_id = s119.vehicle_id
LEFT JOIN s120 ON s1.vehicle_id = s120.vehicle_id;
ASOF JOIN STRATEGY
We get all the rows in all the tables, then do an ASOF JOIN
on the series ID, so we get a row for each row of the first table in the query, in our example ~83K results
WITH
s1 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_1 ),
s2 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_2 ),
...
s118 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_118 ),
s119 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_119 ),
s120 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_120 )
SELECT s1.timestamp, s1.vehicle_id, s1.value AS value_1,
s2.value AS value_2,
...
s119.value AS value_119,
s120.value AS value_120
FROM s1
ASOF JOIN s2 ON s1.vehicle_id = s2.vehicle_id
ASOF JOIN ...
ASOF JOIN s119 ON s1.vehicle_id = s119.vehicle_id
ASOF JOIN s120 ON s1.vehicle_id = s120.vehicle_id;