I try to merge two tables with different time resolution on their closest date.
The tables are like this:
Table1:
id | date | device | value1
----------------------------------
1 | 10:22 | 13 | 0.53
2 | 10:24 | 13 | 0.67
3 | 10:25 | 14 | 0.83
4 | 10:25 | 13 | 0.32
Table2:
id | date | device | value2
----------------------------------
22 | 10:18 | 13 | 0.77
23 | 10:21 | 14 | 0.53
24 | 10:23 | 13 | 0.67
25 | 10:28 | 14 | 0.83
26 | 10:31 | 13 | 0.23
I want to merge these tables along the first one. So I want to append value2 to Table1, where, for each device, the latest value2 appears.
Result:
id | date | device | value1 | value2
-------------------------------------------
1 | 10:22 | 13 | 0.53 | 0.77
2 | 10:24 | 13 | 0.67 | 0.67
3 | 10:25 | 14 | 0.83 | 0.53
4 | 10:25 | 13 | 0.32 | 0.67
I have some (20-30) devices, thousands of rows in Table2 (=m) and millions of them in Table1 (=n).
I could sort all the tables along date (O(n*logn)
), write them into a text file and iterate over Table1 like a merge, while pulling data from Table2 until it is newer (I have to manage that ~20-30 pointers to the latest data for each device, but no more), and after the merge I could upload it back to the database. Then the complexities are O(n*log(n))
for sorting and O(n+m)
for iterating over the tables.
But it would be much better to do it in the database at all. But the best query I could achive was O(n^2) complexity:
SELECT DISTINCT ON (Table1.id)
Table1.id, Table1.date, Table1.device, Table1.value1, Table2.value2
FROM Table1, Table2
WHERE Table1.date > Table2.date and Table1.device = Table2.device
ORDER BY Table1.id, Table1.date-Table2.date;
It's really slow for the data amount I need to process, are there better ways to do this? Or just do that stuff with the downloaded data?
Because table 1 is so much smaller, it might be more efficient to use a correlated subquery:
select t1.*,
(select t2.value2
from table2 t2
where t2.device = t.device and t2.date <= t1.date
order by t2.date desc
limit 1
) as value2
from table1 t1;
Also create an index on table2(device, date, value2)
for performance.