Search code examples
databasetime-seriesinfluxdbtimescaledb

Time Series DBMS for irregular data with gaps


Does a database engine exist which can fill gabs in irregular time series and also group on them?

Most capable system is InfluxDB:

> select * from test;
name: test
time value1 value2
---- ------ ------
1    1      0
2    2      
3    3      
4    4      
5    5      2
6    6      
7    7      
8    8      

> select * from test fill(previous);
name: test
time value1 value2
---- ------ ------
1    1      0
2    2      0
3    3      0
4    4      0
5    5      2
6    6      2
7    7      2
8    8      2

Unfortunately it lacks grouping on keys which are not tags. I need to group on any column, though:

> select mean(value1) from (select value1, value2 from test fill(previous)) group by value2;
name: test
tags: value2=
time mean
---- ----
0    4.5

I also tried TimescaleDB+PostreSQL but TimescaleDB has a poor gap filling function. I use milliseconds since epoch as timestamp in my real data and TimescaleDB fills all gaps between my rows, even if I haven't logged anything in a certain time interval. A table with 17k rows had become a table with 500k rows because of TimescaleDB's sampling.

Edit

Because I was asked to give more information about my attempt to fulfil my needs with TimescaleDB; I followed the information on this page:

https://blog.timescale.com/blog/sql-functions-for-time-series-analysis/

locf is doing what I want to have.

The problem here is that locf cannot be used without an aggregation function like avg. You have to pass very distinct time invervals like described like here:

https://docs.timescale.com/latest/api#locf

I am analyzing valve states. They can either be open or closed. For the sake of upload time I am only uploading data when a valve actually switches its state. But I also want to know if a valve is open while another valve is open. Therfore I need to fill all gaps in each row or in other words do "last observation carried forward".

One of my test data sets has 17000 rows of valve switches. If I set 1ms as a time interval (because in every ms could a valve start to switch its state) in order to use locf with avg over just one observation, I don't get for instance 0.5 for a valve state. It can either be opened or closed. But my number of rows expand from 17000 to 500000 because of the time_bucket_gapfill() function.

The query was somehow similar like this:

SELECT time_bucket_gapfill
(
'1 millisecond',
ts::TIMESTAMP,
start => '2020-03-11 11:09:49',
finish => '2020-03-11 11:17:16'
) AS tsi,
LOCF(avg(airpressure1))
FROM (SELECT TO_TIMESTAMP(TIME::DOUBLE precision / 1000.0)::TIMESTAMP AS TS,
             airpressure1,
             airpressure2,
             airpressure3,
             airpressure4
      FROM ts_data
      WHERE process_id = 1
      ORDER BY TO_TIMESTAMP(TIME::DOUBLE precision / 1000.0)::TIMESTAMP DESC) AS T
GROUP BY tsi;

Edit 2

Real example would be, that no airpressure valve is allowed to be opened without a safety valve to be opened before. With the gaps filled it could easily be queried. But there are many many more use cases. There is also analog data like pressures in the table, but let's skip it because of simplicity reasons.

SELECT TIME,
       airpressure1 as air1,
       safetyvlv_out1 as safety1,
       airpressure2 as air2,
       safetyvlv_out2 as safety2,
       airpressure3 as air3,
       safetyvlv_out3 as safety3,
       airpressure4 as air4,
       safetyvlv_out4 as safety4
FROM ts_data_gaps
WHERE airpressure1 IS NOT NULL
OR    safetyvlv_out1 IS NOT NULL;

time    air1    safety1 air2    safety2 air3    safety3 air4    safety4
1583921837975       1.0     1.0     1.0     1.0
1583921844020       0.0     0.0     0.0     0.0
1583921868224       1.0                     
1583921878845   1.0                         
1583921878985   0.0                         
1583921879798   1.0                         
1583921879900   0.0                         
1583921880537       0.0                     
1583921915212       1.0     1.0     1.0     1.0
1583921919219   1.0     1.0     1.0     1.0 
1583921919488   0.0     0.0     0.0     0.0 
1583921923488   1.0     1.0     1.0     1.0 
1583921926491   0.0     0.0     0.0     0.0 
1583921930497   1.0     1.0     1.0     1.0 
1583921933501   0.0     0.0     0.0     0.0 
1583921934008       0.0     0.0     0.0     0.0
1583921389639       1.0     1.0     1.0     1.0
1583921395681       0.0     0.0     0.0     0.0
1583921415256       1.0                     
1583921425912   1.0                         
1583921426027   0.0                         
1583921426729   1.0                         
1583921426837   0.0                         
1583921427043   1.0                         
1583921427084   0.0                         
1583921427582       0.0     

When I do something like this on 17k rows:

create view z as
SELECT TIME,
       airpressure1 as air1,
       safetyvlv_out1 as safety1,
       airpressure2 as air2,
       safetyvlv_out2 as safety2,
       airpressure3 as air3,
       safetyvlv_out3 as safety3,
       airpressure4 as air4,
       safetyvlv_out4 as safety4
FROM ts_data_gaps
WHERE airpressure1 IS NOT NULL
OR    safetyvlv_out1 IS NOT NULL;

SELECT time_bucket_gapfill
(
'1 millisecond',
ts::TIMESTAMP,
start => '2020-03-11 03:41:08',
finish => '2020-03-11 11:18:54'
) AS tsi,
LOCF(avg(air1)) as air1,
LOCF(avg(safety1)) as safety1
FROM (SELECT TO_TIMESTAMP(TIME::DOUBLE precision / 1000.0)::TIMESTAMP AS TS,
             air1, safety1
      FROM z
      ORDER BY TO_TIMESTAMP(TIME::DOUBLE precision / 1000.0)::TIMESTAMP DESC) AS T
GROUP BY tsi;

My PC with 32 GB of RAM is running out of memory!


Solution

  • The final solution at the end was using the Azure Cloud and the Azure Data Explorer. It's the most powerful tool I have ever seen (beside programming some analytics in python). It fullfills all my needs and is really good in analyzing documents, tables and time series. It uses Kusto as query language which is also (compared to e.g. MongoDB query language) way more easy and intuitive. It's not the best solution if you seek for open source and free of charge solutions, obviously.