Search code examples
sqlalgorithmpostgresqlmergegreatest-n-per-group

Merging two data sets on closest date efficiently in PostgreSQL


I try to merge two tables with different time resolution on their closest date.

The tables are like this:

Table1:

id    | date    | device  | value1
----------------------------------
1     | 10:22   | 13      | 0.53
2     | 10:24   | 13      | 0.67
3     | 10:25   | 14      | 0.83
4     | 10:25   | 13      | 0.32

Table2:

id    | date    | device  | value2
----------------------------------
22    | 10:18   | 13      | 0.77
23    | 10:21   | 14      | 0.53
24    | 10:23   | 13      | 0.67
25    | 10:28   | 14      | 0.83
26    | 10:31   | 13      | 0.23

I want to merge these tables along the first one. So I want to append value2 to Table1, where, for each device, the latest value2 appears.

Result:

id    | date    | device  | value1 | value2
-------------------------------------------
1     | 10:22   | 13      | 0.53   | 0.77
2     | 10:24   | 13      | 0.67   | 0.67
3     | 10:25   | 14      | 0.83   | 0.53
4     | 10:25   | 13      | 0.32   | 0.67

I have some (20-30) devices, thousands of rows in Table2 (=m) and millions of them in Table1 (=n).

I could sort all the tables along date (O(n*logn)), write them into a text file and iterate over Table1 like a merge, while pulling data from Table2 until it is newer (I have to manage that ~20-30 pointers to the latest data for each device, but no more), and after the merge I could upload it back to the database. Then the complexities are O(n*log(n)) for sorting and O(n+m) for iterating over the tables.

But it would be much better to do it in the database at all. But the best query I could achive was O(n^2) complexity:

SELECT DISTINCT ON (Table1.id)
       Table1.id, Table1.date, Table1.device, Table1.value1, Table2.value2
FROM Table1, Table2
WHERE Table1.date > Table2.date and Table1.device = Table2.device
ORDER BY Table1.id, Table1.date-Table2.date;

It's really slow for the data amount I need to process, are there better ways to do this? Or just do that stuff with the downloaded data?


Solution

  • Because table 1 is so much smaller, it might be more efficient to use a correlated subquery:

    select t1.*,
           (select t2.value2
            from table2 t2
            where t2.device = t.device and t2.date <= t1.date
            order by t2.date desc
            limit 1
           ) as value2
    from table1 t1;
    

    Also create an index on table2(device, date, value2) for performance.