Search code examples
sqlselecthiveapache-spark-sql

SQL/Hive Query: Calculate the profit information using sales data and multiple cost entries for a day from costs table


I have below sales data and purchase data and using Hive Rank Over partition, tried to calculate the profit but it's not working.

Note: For same SKU Cost information can be updated multiple times in a day.

For order 1011, there is not entry in the cost table so it should consider the cost which was entered on 2022-05-19 06:50:20.000 as 32.5.

Below is the sample sales data:

Sales Data

Below is the sample cost information:

purchase_cost information


Solution

  • I was able to figure it and implement using window functions, below is the complete query:

    SELECT sku, sum(s.unit_sale_price*s.quantity_sold) as total_sale, sum(s.quantity_sold) total_quantity_sold, sum(s.quantity_sold * e.considered_cost) total_cost, sum((s.unit_sale_price * s.quantity_sold)-(s.quantity_sold * e.considered_cost)) profit FROM sales s inner join ( SELECT sku, lead(update_timestamp,1) over ( partition by sku order by update_timestamp desc ) from_timestamp, update_timestamp to_timestamp, lead(cost,1) over ( partition by sku order by update_timestamp desc ) considered_cost, cost as original_cost, rank() OVER (PARTITION BY sku ORDER BY update_timestamp desc) as rk FROM cost ) e on e.sku = s.sku and s.posted_date between e.from_timestamp and e.to_timestamp group by s.sku

    union all

    SELECT sku, sum(s.unit_sale_prices.quantity_sold) as total_sale, sum(s.quantity_sold) total_quantity_sold, sum(s.quantity_sold * e.original_cost) total_cost, sum((s.unit_sale_prices.quantity_sold)-(s.quantity_sold * e.original_cost)) profit FROM sales s inner join ( SELECT sku, lead(update_timestamp,1) over ( partition by sku order by update_timestamp desc ) from_timestamp, update_timestamp to_timestamp, lead(cost,1) over ( partition by sku order by update_timestamp desc ) considered_cost, cost as original_cost, rank() OVER (PARTITION BY sku ORDER BY update_timestamp desc) as rk FROM cost ) e on e.sku = s.sku and s.posted_date > e.to_timestamp and rk =1 group by s.sku

    union all

    SELECT sku, sum(s.unit_sale_prices.quantity_sold) as total_sale, sum(s.quantity_sold) total_quantity_sold, sum(s.quantity_sold * e.original_cost) total_cost, sum((s.unit_sale_prices.quantity_sold)-(s.quantity_sold * e.original_cost)) profit FROM sales s inner join ( SELECT sku, lead(update_timestamp,1) over ( partition by sku order by update_timestamp asc ) from_timestamp, update_timestamp to_timestamp, lead(cost,1) over ( partition by sku order by update_timestamp asc ) considered_cost, cost as original_cost, rank() OVER (PARTITION BY sku ORDER BY update_timestamp asc) as rk FROM cost ) e on e.sku = s.sku and s.posted_date < e.to_timestamp and rk =1 group by s.sku

    Explaination:

    First I have transposed the cost table into from_timestamp and to_timstamp like below,

    enter image description here

    Cost transpose table query:

    SELECT sku, lead(update_timestamp,1) over ( partition by sku order by update_timestamp desc ) from_timestamp, update_timestamp to_timestamp, lead(cost,1) over ( partition by sku order by update_timestamp desc ) considered_cost, cost as original_cost, rank() OVER (PARTITION BY sku ORDER BY update_timestamp desc) as rk FROM cost

    sku , from_timestamp, to_timstamp, cosindered_cost, original_cost, rk

    then applied inner join with sales table and build the aggregation logic to calculate the profit.