Search code examples
sqlpostgresqlcommon-table-expression

Join complementary tables in Postgesql


Let's say I have two tables table1and table2, with three columns each, id, time, value. They store the same kind of information, ie. a 30-minutes timeseries data for several ids (let's imagine a machine that produces an amount of energy per day). table2 contains more precise information than table1, but not for all timestamp nor all ids.

I want to get the best out of the two tables over the period defined by table2, ie. storing data from table2 when available, and discarding to table1 when required (to add some more complexity, let's say that table1 is not a real table, but rather a view that takes a hell lot of time to be fully computed, so that I want to avoid compute it in its integrality).

I thought I could define a perimeter of id-time to indicate which value should be kept each day (teh daily scale should be equivalent to the 30-minutes timestamp, and be less resource-consuming). Thus I went for :

with perimeter_per_day_table2  as ( 
  select distinct 
         id, 
         date_trunc('day', time) as day
    from table2
), 

perimeter_per_day_table1 as (
  select id, 
         date_trunc('day', time) as day, 
    from table1
    where day >= (select min(time) from table2)
      and day <= (select max(time) from table2)
      and (id, day) not in (select id, day from perimeter_per_day_table2)
)

select * from perimeter_per_day_table1

but that takes a hell lot of time. In particular, it seems like the condition where (id, day) not in (select id, day from perimeter_per_day_table2) is very hard for Potsgresql to handle.

Any suggestion ?


Solution

  • Indeed, NOT IN isn't optimized as well as NOT EXIST in Postgres. So an equivalent not exists () condition is typically faster.

    However, in neither case do you need to apply a (costly) DISTINCT on the rows in the sub-query.

    with perimeter_per_day_table1 as (
      select t1.id, 
             date_trunc('day', t1.time) as day
      from table1 t1
      where t1.day >= (select min(time) from table2)
        and t1.day <= (select max(time) from table2)
        and not exists (select * 
                        from table2 t2 
                        where t1.id = t2.id
                          and t1.day = t2.day)
    )
    select * 
    from perimeter_per_day_table1;
    

    You can even avoid querying table2 twice for the min/max, but I doubt that will make a huge difference if there is an index on the time column:

    with min_max as ( 
       select min(time) as min_time, 
              max(time) as max_time
       from table2
    ), perimeter_per_day_table1 as (
      select t1.id, 
             date_trunc('day', t1.time) as day
      from table1 t1
        cross join min_max 
      where t1.day >= min_max.min_time
        and t1.day <= min_max.max_time
        and not exists (select * 
                        from table2 t2 
                        where t1.id = t2.id
                          and t1.day = t2.day)
    )
    select * 
    from perimeter_per_day_table1;