Search code examples
sqlhiveunion-all

Is there a way to get rid of union all and combine several blocks into one? (Hive SQL)


I have a table with the following fields: start_date, end_date and traffic. I need to group data by every year and month from 2020-01-01 and sum traffic field. The tricky moment is that the year and month are not tied to the start_date and end_date. For example, in the resulting table for the year=2023 and month=2, the traffic sum calculated like this:

select
   2023 as year,
   2 as month,
   sum(traffic) as traffic
from table
where start_date <= '2023-02-01' and end_date >= '2023-02-01'

start_date and end_date values are between '1998-01-01' and now

With union all the required script looks like this:

select
   2020 as year,
   1 as month,
   sum(traffic) as traffic
from table
where start_date <= '2020-01-01' and end_date >= '2020-01-01'

union all

select
   2020 as year,
   2 as month,
   sum(traffic) as traffic
from table
where start_date <= '2020-02-01' and end_date >= '2020-02-01'

union all

select
   2020 as year,
   3 as month,
   sum(traffic) as traffic
from table
where start_date <= '2020-03-01' and end_date >= '2020-03-01'

union all

...
...
...

select
   2024 as year,
   3 as month,
   sum(traffic) as traffic
from table
where start_date <= '2024-03-01' and end_date >= '2024-03-01'

How to get rid of union all and make the script more compact? I tried to create a separate table selecting all dates from 1998, then calculating year, month, and first date of the month for every date. Example:

date month year first_date
1998-01-01 1 1998 1998-01-01
1998-01-02 1 1998 1998-01-01
1998-01-03 1 1998 1998-01-01
... ... ... ...

Then I joined this table with the source table and got this script:

select
   d.year,
   d.month,
   sum(traffic) as traffic
from table as t
join generated_dated as d on d.date = t.start_date
where start_date <= d.first_date and end_date >= d.first_date
group by
   d.year,
   d.month

But the results were wrong.

Data sample:

row start_date end_date traffic
1 2018-09-15 2019-02-13 11
2 2019-02-03 2021-03-05 12
3 2019-09-03 2022-03-05 2
4 2020-01-15 2023-04-02 4
5 2020-01-30 2022-10-19 4
6 2020-02-01 2020-02-02 7
7 2020-02-13 2020-02-19 9
8 2020-02-14 2020-07-05 15
9 2020-03-09 2022-08-08 24
10 2020-05-12 2022-09-01 24
11 2020-06-19 2024-02-05 11

For this data sample results should be:

year month traffic sum *condition *calculation for traffic sum
2020 1 14 "start_date <= '2020-01-01' and end_date >= '2020-01-01'" 'row2+row3'
2020 2 29 "start_date <= '2020-02-01' and end_date >= '2020-02-01'" 'row2+row3+row4+row5+row6'
2020 3 37 "start_date <= '2020-03-01' and end_date >= '2020-03-01'" 'row2+row3+row4+row5+row8'
2020 4 61 "start_date <= '2020-04-01' and end_date >= '2020-04-01'" 'row2+row3+row4+row5+row8+row9'
2020 5 61 "start_date <= '2020-05-01' and end_date >= '2020-05-01'" 'row2+row3+row4+row5+row8+row9'
2020 6 85 "start_date <= '2020-06-01' and end_date >= '2020-06-01'" 'row2+row3+row4+row5+row8+row9+row10'
2020 7 96 "start_date <= '2020-07-01' and end_date >= '2020-07-01'" 'row2+row3+row4+row5+row8+row9+row10+row11'
2020 8 81 "start_date <= '2020-08-01' and end_date >= '2020-08-01'" 'row2+row3+row4+row5+row9+row10+row11'
... ... ... ... ...
2024 1 11 "start_date <= '2024-01-01' and end_date >= '2024-01-01'" 'row11'
2024 2 11 "start_date <= '2024-02-01' and end_date >= '2024-02-01'" 'row11'
2024 3 0 "start_date <= '2024-03-01' and end_date >= '2024-03-01'" 'no rows'

I am using Hive SQL.


Solution

  • This is exactly what you need, first of all, it should be done cross join to find relation between start and end date between necessary table and after that filtering fulfilled conditions

    
    select 
    example_date,
    sum(traffic)
    
    from sample_table 
    full join dates
    on 1=1
    where example_date >= start_date and example_date <= end_date
    group by example_date
    ORDER by example_date