Search code examples
sqlpandasjoinprestotrino

How can I do this multi-index sum/join in SQL instead of within Python?


I have a query that outputs:

  1. location number: lct_nbr
  2. fiscal week end date: fsc_wk_end_dt
  3. item number: itm_nbr
  4. product number: pg_nbr
  5. SUM(quantity): qty
  6. localtimestamp: refresh_date

and I am trying to determine item penetration, or weight.

For each [lct_nbr, fsc_wk_end_dt, pg_nbr] I want to compute the sum of all qty's to get the total per "product group", and then divide the qty for each itm_nbr in that group by the sum. I have attempted to do this in Python via:

import pandas as pd
import math

# lct fsc_wk_end_dt pg_nbr itm_nbr qty

data = [
   [ 1, '2023-08-01', 1, 1, 5 ],
   [ 1, '2023-08-01', 1, 2, 6 ],
   [ 1, '2023-08-01', 1, 3, 7 ],
   [ 1, '2023-08-01', 2, 1, 2 ],
   [ 1, '2023-08-01', 2, 2, 3 ],
   [ 1, '2023-08-01', 2, 3, 4 ],
   [ 1, '2023-09-01', 1, 1, 1 ],
   [ 1, '2023-09-01', 1, 2, 2 ],
   [ 1, '2023-09-01', 1, 3, 3 ],
   [ 1, '2023-09-01', 2, 1, 9 ],
   [ 1, '2023-09-01', 2, 2, 8 ],
   [ 1, '2023-09-01', 2, 3, 7 ],
   [ 2, '2023-08-01', 1, 1, 3 ],
   [ 2, '2023-08-01', 1, 2, 6 ],
   [ 2, '2023-08-01', 1, 3, 9 ],
   [ 2, '2023-08-01', 2, 1, 7 ],
   [ 2, '2023-08-01', 2, 2, 4 ],
   [ 2, '2023-08-01', 2, 3, 1 ],
   [ 2, '2023-09-01', 1, 1, 2 ],
   [ 2, '2023-09-01', 1, 2, 4 ],
   [ 2, '2023-09-01', 1, 3, 2 ],
   [ 2, '2023-09-01', 2, 1, 9 ],
   [ 2, '2023-09-01', 2, 2, 8 ],
   [ 2, '2023-09-01', 2, 3, 5 ]
]

item_penetration_by_units = pd.DataFrame( data, columns = ['lct','fsc_wk_end_dt','pg_nbr','item_nbr','qty'] )
print(item_penetration_by_units)

item_penetration_by_units.set_index(['lct','fsc_wk_end_dt','pg_nbr'],inplace=True)
x1 = item_penetration_by_units.groupby(['lct','fsc_wk_end_dt','pg_nbr']).sum()
item_penetration_by_units['total'] = item_penetration_by_units.join( x1, how='left', rsuffix='x')['qtyx']
item_penetration_by_units['weight'] = item_penetration_by_units['qty']/item_penetration_by_units['total']
item_penetration_by_units.reset_index(inplace=True)
print(item_penetration_by_units)

How could I so this directly with SQL in Trino instead of within this Python script? Would it be faster, or not?


Solution

  • The analog in Presto/Trino would be using window functions:

    -- sample data
    with dataset(lct, fsc_wk_end_dt, pg_nbr, itm_nbr, qty) as (
        values (1, '2023-08-01', 1, 1, 5),
            (1, '2023-08-01', 1, 2, 6),
            (1, '2023-08-01', 1, 3, 7),
            (1, '2023-08-01', 2, 1, 2),
            (1, '2023-08-01', 2, 2, 3),
            (1, '2023-08-01', 2, 3, 4),
            (1, '2023-09-01', 1, 1, 1),
            (1, '2023-09-01', 1, 2, 2),
            (1, '2023-09-01', 1, 3, 3),
            (1, '2023-09-01', 2, 1, 9),
            (1, '2023-09-01', 2, 2, 8),
            (1, '2023-09-01', 2, 3, 7),
            (2, '2023-08-01', 1, 1, 3),
            (2, '2023-08-01', 1, 2, 6),
            (2, '2023-08-01', 1, 3, 9),
            (2, '2023-08-01', 2, 1, 7),
            (2, '2023-08-01', 2, 2, 4),
            (2, '2023-08-01', 2, 3, 1),
            (2, '2023-09-01', 1, 1, 2),
            (2, '2023-09-01', 1, 2, 4),
            (2, '2023-09-01', 1, 3, 2),
            (2, '2023-09-01', 2, 1, 9),
            (2, '2023-09-01', 2, 2, 8),
            (2, '2023-09-01', 2, 3, 5)
    )
    -- query
    select *,
        qty * 1.000000/total weight
    from (
        select *,
            sum(qty) over (partition by lct, fsc_wk_end_dt, pg_nbr ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) total
        from dataset)
    order by lct, fsc_wk_end_dt, pg_nbr, itm_nbr;
    

    Would it be faster, or not?

    It can depend on what you want to do with it. You will need to measure it using your actual setups and data.