I have a query that outputs:
and I am trying to determine item penetration, or weight.
For each [lct_nbr, fsc_wk_end_dt, pg_nbr] I want to compute the sum of all qty's to get the total per "product group", and then divide the qty for each itm_nbr in that group by the sum. I have attempted to do this in Python via:
import pandas as pd
import math
# lct fsc_wk_end_dt pg_nbr itm_nbr qty
data = [
[ 1, '2023-08-01', 1, 1, 5 ],
[ 1, '2023-08-01', 1, 2, 6 ],
[ 1, '2023-08-01', 1, 3, 7 ],
[ 1, '2023-08-01', 2, 1, 2 ],
[ 1, '2023-08-01', 2, 2, 3 ],
[ 1, '2023-08-01', 2, 3, 4 ],
[ 1, '2023-09-01', 1, 1, 1 ],
[ 1, '2023-09-01', 1, 2, 2 ],
[ 1, '2023-09-01', 1, 3, 3 ],
[ 1, '2023-09-01', 2, 1, 9 ],
[ 1, '2023-09-01', 2, 2, 8 ],
[ 1, '2023-09-01', 2, 3, 7 ],
[ 2, '2023-08-01', 1, 1, 3 ],
[ 2, '2023-08-01', 1, 2, 6 ],
[ 2, '2023-08-01', 1, 3, 9 ],
[ 2, '2023-08-01', 2, 1, 7 ],
[ 2, '2023-08-01', 2, 2, 4 ],
[ 2, '2023-08-01', 2, 3, 1 ],
[ 2, '2023-09-01', 1, 1, 2 ],
[ 2, '2023-09-01', 1, 2, 4 ],
[ 2, '2023-09-01', 1, 3, 2 ],
[ 2, '2023-09-01', 2, 1, 9 ],
[ 2, '2023-09-01', 2, 2, 8 ],
[ 2, '2023-09-01', 2, 3, 5 ]
]
item_penetration_by_units = pd.DataFrame( data, columns = ['lct','fsc_wk_end_dt','pg_nbr','item_nbr','qty'] )
print(item_penetration_by_units)
item_penetration_by_units.set_index(['lct','fsc_wk_end_dt','pg_nbr'],inplace=True)
x1 = item_penetration_by_units.groupby(['lct','fsc_wk_end_dt','pg_nbr']).sum()
item_penetration_by_units['total'] = item_penetration_by_units.join( x1, how='left', rsuffix='x')['qtyx']
item_penetration_by_units['weight'] = item_penetration_by_units['qty']/item_penetration_by_units['total']
item_penetration_by_units.reset_index(inplace=True)
print(item_penetration_by_units)
How could I so this directly with SQL in Trino instead of within this Python script? Would it be faster, or not?
The analog in Presto/Trino would be using window functions:
-- sample data
with dataset(lct, fsc_wk_end_dt, pg_nbr, itm_nbr, qty) as (
values (1, '2023-08-01', 1, 1, 5),
(1, '2023-08-01', 1, 2, 6),
(1, '2023-08-01', 1, 3, 7),
(1, '2023-08-01', 2, 1, 2),
(1, '2023-08-01', 2, 2, 3),
(1, '2023-08-01', 2, 3, 4),
(1, '2023-09-01', 1, 1, 1),
(1, '2023-09-01', 1, 2, 2),
(1, '2023-09-01', 1, 3, 3),
(1, '2023-09-01', 2, 1, 9),
(1, '2023-09-01', 2, 2, 8),
(1, '2023-09-01', 2, 3, 7),
(2, '2023-08-01', 1, 1, 3),
(2, '2023-08-01', 1, 2, 6),
(2, '2023-08-01', 1, 3, 9),
(2, '2023-08-01', 2, 1, 7),
(2, '2023-08-01', 2, 2, 4),
(2, '2023-08-01', 2, 3, 1),
(2, '2023-09-01', 1, 1, 2),
(2, '2023-09-01', 1, 2, 4),
(2, '2023-09-01', 1, 3, 2),
(2, '2023-09-01', 2, 1, 9),
(2, '2023-09-01', 2, 2, 8),
(2, '2023-09-01', 2, 3, 5)
)
-- query
select *,
qty * 1.000000/total weight
from (
select *,
sum(qty) over (partition by lct, fsc_wk_end_dt, pg_nbr ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) total
from dataset)
order by lct, fsc_wk_end_dt, pg_nbr, itm_nbr;
Would it be faster, or not?
It can depend on what you want to do with it. You will need to measure it using your actual setups and data.