Search code examples
sqlhivecumulative-sumcross-join

What is an efficient alternative to cross join two large tables to get running total?


I have 2 tables whose schema is as follows: table1

event_dt
6/30/2018
7/1/2018
7/2/2018
7/3/2018
7/4/2018
7/5/2018
7/6/2018
7/7/2018
7/8/2018
7/9/2018
7/10/2018

table:2

event_dt    time(in seconds)
7/7/2018     144 
7/8/2018     63 
7/1/2018     47 
7/8/2018     81 
7/9/2018     263 
7/7/2018     119 
7/8/2018     130 
7/9/2018     206 
7/5/2018     134 
7/1/2018     140 

For each date in table 1 i want to find the cumulative sum of time upto that date .So i used a cross join to get the output using the following code:

select t1.event_dt, sum(t2.time)
from yp1 t1 cross join yp2 t2
where t1.event_dt>=t2.event_dt
group by t1.event_dt

Using this query i was able to get the cumulative running total for each date in table 1 as long as there is an event before that day. For example first event date is 07/01/2018 but the first date in table1 is 06/30/2018 so in the final output 6/30/2018 wont be present.

The problem with this method is the cross join is taking too long, i have millions of records since an observation is taken every 6 seconds. SO is there a way to get the same results without a cross join or for that matter any way which is more efficient.


Solution

  • I think the best way is to use SQL's cumulative sum function:

    select event_dt, running_time
    from (select event_dt, time, sum(time) over (order by event_dt) as running_time
          from ((select event_dt, null as time
                 from t1
                ) union all
                (select event_dt, time
                 from t2
                ) 
               ) tt
         ) tt
    where time is null;