I have a table with the following fields: start_date
, end_date
and traffic
. I need to group data by every year and month from 2020-01-01
and sum traffic
field. The tricky moment is that the year and month are not tied to the start_date
and end_date
. For example, in the resulting table for the year=2023 and month=2, the traffic sum calculated like this:
select
2023 as year,
2 as month,
sum(traffic) as traffic
from table
where start_date <= '2023-02-01' and end_date >= '2023-02-01'
start_date
and end_date
values are between '1998-01-01' and now
With union all
the required script looks like this:
select
2020 as year,
1 as month,
sum(traffic) as traffic
from table
where start_date <= '2020-01-01' and end_date >= '2020-01-01'
union all
select
2020 as year,
2 as month,
sum(traffic) as traffic
from table
where start_date <= '2020-02-01' and end_date >= '2020-02-01'
union all
select
2020 as year,
3 as month,
sum(traffic) as traffic
from table
where start_date <= '2020-03-01' and end_date >= '2020-03-01'
union all
...
...
...
select
2024 as year,
3 as month,
sum(traffic) as traffic
from table
where start_date <= '2024-03-01' and end_date >= '2024-03-01'
How to get rid of union all
and make the script more compact? I tried to create a separate table selecting all dates from 1998, then calculating year, month, and first date of the month for every date. Example:
date | month | year | first_date |
---|---|---|---|
1998-01-01 | 1 | 1998 | 1998-01-01 |
1998-01-02 | 1 | 1998 | 1998-01-01 |
1998-01-03 | 1 | 1998 | 1998-01-01 |
... | ... | ... | ... |
Then I joined this table with the source table and got this script:
select
d.year,
d.month,
sum(traffic) as traffic
from table as t
join generated_dated as d on d.date = t.start_date
where start_date <= d.first_date and end_date >= d.first_date
group by
d.year,
d.month
But the results were wrong.
Data sample:
row | start_date | end_date | traffic |
---|---|---|---|
1 | 2018-09-15 | 2019-02-13 | 11 |
2 | 2019-02-03 | 2021-03-05 | 12 |
3 | 2019-09-03 | 2022-03-05 | 2 |
4 | 2020-01-15 | 2023-04-02 | 4 |
5 | 2020-01-30 | 2022-10-19 | 4 |
6 | 2020-02-01 | 2020-02-02 | 7 |
7 | 2020-02-13 | 2020-02-19 | 9 |
8 | 2020-02-14 | 2020-07-05 | 15 |
9 | 2020-03-09 | 2022-08-08 | 24 |
10 | 2020-05-12 | 2022-09-01 | 24 |
11 | 2020-06-19 | 2024-02-05 | 11 |
For this data sample results should be:
year | month | traffic sum | *condition | *calculation for traffic sum |
---|---|---|---|---|
2020 | 1 | 14 | "start_date <= '2020-01-01' and end_date >= '2020-01-01'" | 'row2+row3' |
2020 | 2 | 29 | "start_date <= '2020-02-01' and end_date >= '2020-02-01'" | 'row2+row3+row4+row5+row6' |
2020 | 3 | 37 | "start_date <= '2020-03-01' and end_date >= '2020-03-01'" | 'row2+row3+row4+row5+row8' |
2020 | 4 | 61 | "start_date <= '2020-04-01' and end_date >= '2020-04-01'" | 'row2+row3+row4+row5+row8+row9' |
2020 | 5 | 61 | "start_date <= '2020-05-01' and end_date >= '2020-05-01'" | 'row2+row3+row4+row5+row8+row9' |
2020 | 6 | 85 | "start_date <= '2020-06-01' and end_date >= '2020-06-01'" | 'row2+row3+row4+row5+row8+row9+row10' |
2020 | 7 | 96 | "start_date <= '2020-07-01' and end_date >= '2020-07-01'" | 'row2+row3+row4+row5+row8+row9+row10+row11' |
2020 | 8 | 81 | "start_date <= '2020-08-01' and end_date >= '2020-08-01'" | 'row2+row3+row4+row5+row9+row10+row11' |
... | ... | ... | ... | ... |
2024 | 1 | 11 | "start_date <= '2024-01-01' and end_date >= '2024-01-01'" | 'row11' |
2024 | 2 | 11 | "start_date <= '2024-02-01' and end_date >= '2024-02-01'" | 'row11' |
2024 | 3 | 0 | "start_date <= '2024-03-01' and end_date >= '2024-03-01'" | 'no rows' |
I am using Hive SQL
.
This is exactly what you need, first of all, it should be done cross join to find relation between start and end date between necessary table and after that filtering fulfilled conditions
select
example_date,
sum(traffic)
from sample_table
full join dates
on 1=1
where example_date >= start_date and example_date <= end_date
group by example_date
ORDER by example_date