How can I do this multi-index sum/join in SQL instead of within Python?

I have a query that outputs:

location number: lct_nbr
fiscal week end date: fsc_wk_end_dt
item number: itm_nbr
product number: pg_nbr
SUM(quantity): qty
localtimestamp: refresh_date

and I am trying to determine item penetration, or weight.

For each [lct_nbr, fsc_wk_end_dt, pg_nbr] I want to compute the sum of all qty's to get the total per "product group", and then divide the qty for each itm_nbr in that group by the sum. I have attempted to do this in Python via:

import pandas as pd
import math

# lct fsc_wk_end_dt pg_nbr itm_nbr qty

data = [
   [ 1, '2023-08-01', 1, 1, 5 ],
   [ 1, '2023-08-01', 1, 2, 6 ],
   [ 1, '2023-08-01', 1, 3, 7 ],
   [ 1, '2023-08-01', 2, 1, 2 ],
   [ 1, '2023-08-01', 2, 2, 3 ],
   [ 1, '2023-08-01', 2, 3, 4 ],
   [ 1, '2023-09-01', 1, 1, 1 ],
   [ 1, '2023-09-01', 1, 2, 2 ],
   [ 1, '2023-09-01', 1, 3, 3 ],
   [ 1, '2023-09-01', 2, 1, 9 ],
   [ 1, '2023-09-01', 2, 2, 8 ],
   [ 1, '2023-09-01', 2, 3, 7 ],
   [ 2, '2023-08-01', 1, 1, 3 ],
   [ 2, '2023-08-01', 1, 2, 6 ],
   [ 2, '2023-08-01', 1, 3, 9 ],
   [ 2, '2023-08-01', 2, 1, 7 ],
   [ 2, '2023-08-01', 2, 2, 4 ],
   [ 2, '2023-08-01', 2, 3, 1 ],
   [ 2, '2023-09-01', 1, 1, 2 ],
   [ 2, '2023-09-01', 1, 2, 4 ],
   [ 2, '2023-09-01', 1, 3, 2 ],
   [ 2, '2023-09-01', 2, 1, 9 ],
   [ 2, '2023-09-01', 2, 2, 8 ],
   [ 2, '2023-09-01', 2, 3, 5 ]
]

item_penetration_by_units = pd.DataFrame( data, columns = ['lct','fsc_wk_end_dt','pg_nbr','item_nbr','qty'] )
print(item_penetration_by_units)

item_penetration_by_units.set_index(['lct','fsc_wk_end_dt','pg_nbr'],inplace=True)
x1 = item_penetration_by_units.groupby(['lct','fsc_wk_end_dt','pg_nbr']).sum()
item_penetration_by_units['total'] = item_penetration_by_units.join( x1, how='left', rsuffix='x')['qtyx']
item_penetration_by_units['weight'] = item_penetration_by_units['qty']/item_penetration_by_units['total']
item_penetration_by_units.reset_index(inplace=True)
print(item_penetration_by_units)

How could I so this directly with SQL in Trino instead of within this Python script? Would it be faster, or not?

Solution

The analog in Presto/Trino would be using window functions:

-- sample data
with dataset(lct, fsc_wk_end_dt, pg_nbr, itm_nbr, qty) as (
    values (1, '2023-08-01', 1, 1, 5),
        (1, '2023-08-01', 1, 2, 6),
        (1, '2023-08-01', 1, 3, 7),
        (1, '2023-08-01', 2, 1, 2),
        (1, '2023-08-01', 2, 2, 3),
        (1, '2023-08-01', 2, 3, 4),
        (1, '2023-09-01', 1, 1, 1),
        (1, '2023-09-01', 1, 2, 2),
        (1, '2023-09-01', 1, 3, 3),
        (1, '2023-09-01', 2, 1, 9),
        (1, '2023-09-01', 2, 2, 8),
        (1, '2023-09-01', 2, 3, 7),
        (2, '2023-08-01', 1, 1, 3),
        (2, '2023-08-01', 1, 2, 6),
        (2, '2023-08-01', 1, 3, 9),
        (2, '2023-08-01', 2, 1, 7),
        (2, '2023-08-01', 2, 2, 4),
        (2, '2023-08-01', 2, 3, 1),
        (2, '2023-09-01', 1, 1, 2),
        (2, '2023-09-01', 1, 2, 4),
        (2, '2023-09-01', 1, 3, 2),
        (2, '2023-09-01', 2, 1, 9),
        (2, '2023-09-01', 2, 2, 8),
        (2, '2023-09-01', 2, 3, 5)
)
-- query
select *,
    qty * 1.000000/total weight
from (
    select *,
        sum(qty) over (partition by lct, fsc_wk_end_dt, pg_nbr ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) total
    from dataset)
order by lct, fsc_wk_end_dt, pg_nbr, itm_nbr;

Would it be faster, or not?

It can depend on what you want to do with it. You will need to measure it using your actual setups and data.