Efficient way to query sparse data using QuestDB

I have a sparse sensors table, with 120 sensor columns, in which I am getting just a few sensor values at any given timestamp, so most of my values are null, as in the image below.

When I want to query data from any given sensor, of from all, I am first SAMPLING the data with an avg or a last_not_null function aggregation, and then I often build a CTE and call LATEST ON to get results.

SELECT
        timestamp,
        vehicle_id,
        avg(sensor_1) AS avg_sensor_1, avg(sensor_2) AS avg_sensor_2,
        ...
        avg(sensor_119) AS avg_sensor_119, avg(sensor_120) AS avg_sensor_120
    FROM
        vehicle_sensor_data
    -- WHERE vehicle_id = 'AAA0000'
    SAMPLE BY 30s
    LIMIT 100000;

This works, but it is not super fast (1sec for 10 million rows, in a table with 120 sensor columns and with 10k different vehicle_ids), and I am also aware this is not very efficient because null columns take some bytes on disk, and in my case this is probably wasteful.

I was thinking of partitioning by hour or day, and at the end of each partition period materialising the SAMPLE BY query into a dense table, and then dropping the partition from the spare table. That should probably work fine, but I was wondering if anyone had a better suggestion?

Solution

A single table works, but there is a more efficient (although a bit more cumbersome if you compose queries by hand) way to do this.

You can create 120 tables, one per sensor, rather than a table with 120 columns. Well, technically you probably want 121 tables, one with the common dimensions, then 1 per sensor. Or maybe you want N tables, one for the common dimensions, then N depending on how many sensor groups you have, as some groups might always send in sync. In any case, rather than a wide table you would end up with several narrow tables that you would need to join, as in the picture below.

Now for joining the tables there are three potential ways, depending on the results you are after:

To see the LATEST known value for all the metrics for a given series, we would use a CROSS JOIN strategy (example below). This would return a single row.
To see the LATEST known value for all the metrics and for all or several series, we would use a LEFT JOIN strategy. This would return a single row per series (example below)
To see the rolling view of all the latest known values regarding the current row for one of the metrics, we would use an ASOF JOIN strategy. This would return as many rows as we have on the main metric we are querying. (example below)

I've tested the three approaches and they perform well. The three queries were executed on a table like the initial one, with 10 million rows representing sparse data from 10k series and across 120 metrics, so 120 tables. Each of the 120 tables had ~83k records (which times 120 is ~10 million rows).

CROSS JOIN is the fastest, executing in 23ms, ASOF JOIN is second with 123 ms, and LEFT JOIN is the slowest at 880ms. Still not too bad, as you probably will not want to get all the sensors from all the devices all the time, and joining fewer tables would perform better.

See the example queries below

CROSS JOIN STRATEGY We first find the latest point in each of the 120 tables for the given series (AAA0000), so we get a value per table, and then do a CROSS JOIN, to get a single row.

WITH
s1 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_1 
    WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id),
s2 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_2 
    WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id),
...
s119 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_119 
    WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id),
s120 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_120 
    WHERE vehicle_id = 'AAA0000' LATEST ON timestamp PARTITION BY vehicle_id)
SELECT s1.timestamp, s1.vehicle_id, s1.value AS value_1, 
s2.value AS value_2, 
...
s119.value AS value_119, 
s120.value AS value_120
FROM s1
CROSS JOIN s2 
CROSS JOIN ...
CROSS JOIN s119 
CROSS JOIN s120;

LEFT JOIN STRATEGY We first find the latest point in each of the 120 tables for each series, so we get a value per table and series, and then do a LEFT JOIN on the series ID, to get a single row for each different series (10K rows in our example)

WITH
s1 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_1 
    LATEST ON timestamp PARTITION BY vehicle_id),
s2 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_2 
    LATEST ON timestamp PARTITION BY vehicle_id),
...
s119 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_119 
    LATEST ON timestamp PARTITION BY vehicle_id),
s120 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_120 
    LATEST ON timestamp PARTITION BY vehicle_id)
SELECT s1.timestamp, s1.vehicle_id, s1.value AS value_1, 
s2.value AS value_2, 
...
s119.value AS value_119, 
s120.value AS value_120
FROM s1
LEFT JOIN s2 ON s1.vehicle_id = s2.vehicle_id 
LEFT JOIN ...
LEFT JOIN s119 ON s1.vehicle_id = s119.vehicle_id 
LEFT JOIN s120 ON s1.vehicle_id = s120.vehicle_id;

ASOF JOIN STRATEGY

We get all the rows in all the tables, then do an ASOF JOIN on the series ID, so we get a row for each row of the first table in the query, in our example ~83K results

WITH
s1 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_1 ),
s2 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_2 ),
...
s118 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_118 ),
s119 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_119 ),
s120 AS (SELECT timestamp, vehicle_id, value FROM vehicle_sensor_120 )
SELECT s1.timestamp, s1.vehicle_id, s1.value AS value_1, 
       s2.value AS value_2, 
       ... 
       s119.value AS value_119, 
       s120.value AS value_120
FROM s1
ASOF JOIN s2 ON s1.vehicle_id = s2.vehicle_id 
ASOF JOIN ... 
ASOF JOIN s119 ON s1.vehicle_id = s119.vehicle_id 
ASOF JOIN s120 ON s1.vehicle_id = s120.vehicle_id;