Search code examples
sqlhivehqlhiveqlwindow-functions

Compare monthly data while retaining daily granularity


I have the below data which contains monthly targets for a set of ids. The targets are for each id, for every month in 2020. The table named targets. The month column indicates the month of year.

+-------+-------+----+--------+
| month | name  | id | target |
+-------+-------+----+--------+
| 1     | Comp1 | 1  | 6000   |
+-------+-------+----+--------+
| 2     | Comp1 | 1  | 6000   |
+-------+-------+----+--------+
| 3     | Comp1 | 1  | 6000   |
+-------+-------+----+--------+
| 1     | Comp2 | 2  | 6000   |
+-------+-------+----+--------+
| 2     | Comp2 | 2  | 6000   |
+-------+-------+----+--------+
| 3     | Comp2 | 2  | 6000   |
+-------+-------+----+--------+
| 1     | Comp3 | 3  | 6000   |
+-------+-------+----+--------+
| 2     | Comp3 | 3  | 6000   |
+-------+-------+----+--------+
| 3     | Comp3 | 3  | 6000   |
+-------+-------+----+--------+
| 1     | Comp4 | 4  | 6000   |
+-------+-------+----+--------+
| 2     | Comp4 | 4  | 6000   |
+-------+-------+----+--------+
| 3     | Comp4 | 4  | 6000   |
+-------+-------+----+--------+

I then have a second table which contains daily data for a set of ids and is updated daily. In my actual dataset, I've got data from 2019-01-01 up until current day.

+------------+-------+----+--------+--------+
| yyyy_mm_dd | name  | id | actual | region |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp1 | 1  | 1000   | LATAM  |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp1 | 1  |   0    |  EU    |
+-------------------------------------------+
| 2019-01-02 | Comp1 | 1  | 2000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp1 | 1  | 4000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp2 | 2  | 1000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp2 | 2  | 2000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp2 | 2  | 3000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp3 | 3  | 1000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp3 | 3  | 2000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-03 | Comp3 | 3  | 8000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-01 | Comp4 | 4  | 1000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-01-02 | Comp4 | 4  | 2000   |  EU    |
+------------+-------+----+--------+--------+
| 2019-02-03 | Comp4 | 4  | 3000   |  EU    |
+------------+-------+----+--------+--------+

Based on the above two tables, I want to create a third table with some additional logic. Ultimately, I want to have a new column introduced called payment. This column should always be 0 unless the company has passed their monthly target. If the monthly target is met/passed, the payout should then be sum actual for that month - monthly target for that month * 1%.

Here is how the output data might look:

+------------+-------+----+--------+--------+
| yyyy_mm_dd | name  | id | actual | payout |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp1 | 1  | 1000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp1 | 1  | 2000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp1 | 1  | 4000   | 10     |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp2 | 2  | 1000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp2 | 2  | 2000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp2 | 2  | 3000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp3 | 3  | 1000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp3 | 3  | 2000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-03 | Comp3 | 3  | 8000   | 50     |
+------------+-------+----+--------+--------+
| 2020-01-01 | Comp4 | 4  | 1000   | 0      |
+------------+-------+----+--------+--------+
| 2020-01-02 | Comp4 | 4  | 2000   | 0      |
+------------+-------+----+--------+--------+
| 2020-02-03 | Comp4 | 4  | 3000   | 0      |
+------------+-------+----+--------+--------+

All names/ids in the above dataset have a monthly target of 6000. So there should only be a payout when a name/id passes that target during the month. Comp1 and Comp3 both passed the monthly target on the third day of January so they get a payout from that day onwards until the end of the month. This then resets in February as it's a new month with a new target and we'll get new daily data as the month progresses.


What I've tried:

SELECT
    agg.yyyy_mm_dd,
    agg.name,
    agg.id,
    CASE WHEN agg.actual >= targets.target THEN ((agg.actual-targets.target)/100) * 1 ELSE 0 END AS payout
FROM(
    SELECT
        sum(x.actual) AS actual,
        x.yyyy_mm_dd,
        x.name,
        x.id
    FROM(
        SELECT
            yyyy_mm_dd,
            name,
            id,
            cast(actual as int) as actual
        FROM
            schema.daily_data
        WHERE
            yyyy_mm_dd >= '2020-01-01' AND (name = 'Comp1' OR name = 'Comp2')
    ) x
    GROUP BY
        2,3,4
) agg
INNER JOIN(
    SELECT
      id,
      month,
      target
    FROM
        schema.targets
) targets ON targets.id = agg.id
GROUP BY
    1,2,3,4

However, the above outputs multiple rows per name. This is a result of the daily table having the same company multiple times per day (expected). I thought my grouping would have handled that. Also, I don't think it's the most simple solution overall and I'm probably overthinking it / could be done more efficiently.


Solution

  • I think I've now got a working solution. The below is giving the output as expected. It could probably be optimized a bit as it isn't the quickest.

    SELECT
        x.yyyy_mm_dd,
        x.id,
        x.name,
        x.actual,
        x.target,
        x.actual_to_date,
        CASE WHEN x.actual_to_date > x.target THEN ((x.actual_to_date - x.target) /100) * 1 ELSE 0 END AS payout
    FROM(
        SELECT
            daily.yyyy_mm_dd,
            daily.id,
            daily.name,
            daily.actual,
            t.target,
            SUM(daily.actual) OVER (PARTITION BY MONTH(daily.yyyy_mm_dd), daily.id ORDER BY daily.yyyy_mm_dd RANGE UNBOUNDED PRECEDING) AS actual_to_date
        FROM(
            SELECT
                yyyy_mm_dd,
                id,
                name,
                sum(cast(actual as int)) as actual
            FROM
                daily_data_table
            WHERE
                yyyy_mm_dd >= '2020-01-01'
            GROUP BY
                1,2,3
        ) daily
        INNER JOIN
            monthly_target_table t
            ON t.id = daily.id AND t.month = month(daily.yyyy_mm_dd)
        WHERE
            daily.name = 'Comp1'
    ) x