Search code examples
sqlamazon-athenapresto

Athena array aggregate and filter multiple columns on condition


I have data as shown below.

uuid movie data
1 movie1 {title=rental, label=GA, price=50, feetype=rental, hidden=false}
1 movie1 {title=tax, label=GA, price=25, feetype=service-fees, hidden=true}
1 movie1 {title=rental, label=GA, price=50, feetype=rental, hidden=false}
1 movie1 {title=tax, label=GA, price=25, feetype=service-fees, hidden=true}
2 movie3 {title=rental, label=VIP, price=100, feetype=rental, hidden=false}
2 movie3 {title=tax, label=VIP, price=25, feetype=service-fees, hidden=true}
2 movie3 {title=promo, label=VIP, price=10, feetype=discount, hidden=false}

and, this is how i want the result to be like.

uuid total_fee total_discount discount_type
1 150 0 NA
2 125 10 promo

I tried using

SELECT uuid
   , sum("fee"."price") "total_fee"   
   , array_agg(distinct("fee"."feetype")) "fee_type"
   , array_agg(distinct("fee"."title")) "fee_name"

This gives the result as shown below,

uuid total_fee fee_type fee_name
1 100 [rental] [rental]
1 50 [service-fees] [tax]
2 100 [rental] [rental]
2 25 [service-fees] [tax]
2 10 [discount] [promo]

Now how do I aggregate on total_fee and filter fee_name based on fee_type?

I tried using

, CASE WHEN regexp_like(array_join(fee_type, ','), 'discount') THEN sum("fee") ELSE 0  END "discount"

but that resulted in

SYNTAX_ERROR: line 207:6: '(CASE WHEN "regexp_like"("array_join"(fee_type, ','), 'discount') THEN "sum"("fee") ELSE 0 END)' must be an aggregate expression or appear in GROUP BY clause

Solution

  • You should be able to do something like this:

    SELECT
      uuid,
      SUM(fee.price) AS total_fee,
      SUM(fee.price) FILTER (WHERE fee.feetype = 'discount') AS total_discount,
      ARBITRARY(fee.title) FILTER (WHERE fee.feetype = 'discount') AS discount_type
    FROM …
    GROUP BY uuid
    

    (I'm assuming the data column in your example is the same as the fee column in your query).

    Aggregate functions support a FILTER clause that selects the rows to include into the aggregation. This can also be achieved by e.g. SUM(IF(fee.feetype = 'discount', fee.price, 0)), which is more compact but not as elegant.

    The ARBITRARY aggregate function picks an arbitrary value from the group. I don't know if that's appropriate in your case, but I assume that there will only be one discount row per group. If there are more than one you might want to use ARRAY_AGG with the DISTINCT clause (e.g. ARRAY_AGG(DISTINCT fee.title) to get the all).