I have a table consisting of customer events. These events have arrays containing keys and values.
The following simulated table illustrates what these events could look like:
user | event_date | event_timestamp | store | city | purchases.key | purchases.int_value | purchases.float_value | purchases.string_value |
---|---|---|---|---|---|---|---|---|
XXX | 2025-01-01 | 1735693200 | store_1 | london | type | online | ||
ware | banana | |||||||
amount | 2 | |||||||
price | 0.5 | |||||||
XXX | 2025-01-01 | 1735693201 | store_1 | london | type | online | ||
ware | apple | |||||||
amount | 3 | |||||||
price | 0.6 | |||||||
XXX | 2025-01-01 | 1735693202 | store_1 | london | type | online | ||
ware | banana | |||||||
amount | 3 | |||||||
price | 0.5 | |||||||
XXX | 2025-01-02 | 1735693203 | store_1 | london | type | online | ||
ware | bread | |||||||
amount | 1 | |||||||
price | 1.0 | |||||||
YYY | 2025-01-01 | 1735693204 | store_2 | sydney | type | physical | ||
ware | milk | |||||||
amount | 1 | |||||||
price | 1.5 |
WITH
CTE_TEST_DATA AS (
SELECT
'XXX' AS user
,'2025-01-01' AS event_date
,'store_1' AS store
,'london' AS city
,[
STRUCT('type' AS key, NULL AS int_value, NULL as float_value, 'online' AS string_value),
STRUCT('ware', NULL, NULL, 'banana'),
STRUCT('amount', 2, NULL, NULL),
STRUCT('price', NULL, 0.5, NULL)
] AS purchases
UNION ALL
SELECT
'XXX' AS user
,'2025-01-01' AS event_date
,'store_1' AS store
,'london' AS city
,[
STRUCT('type', NULL, NULL, 'online'),
STRUCT('ware', NULL, NULL, 'apple'),
STRUCT('amount', 3, NULL, NULL),
STRUCT('price', NULL, 0.6, NULL)
] AS purchases
UNION ALL
SELECT
'XXX' AS user
,'2025-01-01' AS event_date
,'store_1' AS store
,'london' AS city
,[
STRUCT('type', NULL, NULL, 'online'),
STRUCT('ware', NULL, NULL, 'banana'),
STRUCT('amount', 3, NULL, NULL),
STRUCT('price', NULL, 0.5, NULL)
] AS purchases
UNION ALL
SELECT
'XXX' AS user
,'2025-01-02' AS event_date
,'store_1' AS store
,'london' AS city
,[
STRUCT('type', NULL, NULL, 'online'),
STRUCT('ware', NULL, NULL, 'bread'),
STRUCT('amount', 1, NULL, NULL),
STRUCT('price', NULL, 1.0, NULL)
] AS purchases
UNION ALL
SELECT
'YYY' AS user
,'2025-01-01' AS event_date
,'store_2' AS store
,'sydney' AS city
,[
STRUCT('type', NULL, NULL, 'physical'),
STRUCT('ware', NULL, NULL, 'milk'),
STRUCT('amount', 1, NULL, NULL),
STRUCT('price', NULL, 1.5, NULL)
] AS purchases
)
The actual Customer Events table is quite big, hence a daily aggregation query has been created to make analysis and reporting less costly.
This daily aggregation query yields aggregated results per user per event_date:
SELECT
CTD.user
,CTD.event_date
,STRUCT(COUNT(*) as amount) AS events_per_user
,*๐๐กโ๐๐ ๐๐๐๐๐๐๐๐ก๐๐ *
FROM
CTE_TEST_DATA AS CTD
GROUP BY
CTD.user
,CTD.event_date
When new metrics are added to the Customer Events table, the daily aggregation query must be modified to include these.
Because of this setup, I have a few constraints when adding aggregated columns, the main ones being that new metrics must be added exclusively using correlated subqueries, and that Common Table Expressions cannot be used.
The result I am trying to achieve is the following, where the array contains only distinct ware names:
user | event_date | events_per_user.amount | ware.name | ware.distinct_orders | total.amount |
---|---|---|---|---|---|
xxx | 2025-01-01 | 3 | banana | 2 | 5 |
apple | 1 | 3 | |||
yyy | 2025-01-01 | 1 | milk | 1 | 1 |
xxx | 2025-01-02 | 1 | bread | 1 | 1 |
Unfortunately, the closest I have been to achieve that result, is using the following query:
SELECT
CTD.user
,CTD.event_date
,STRUCT(COUNT(*) as amount) AS events_per_user
,ARRAY_AGG((
SELECT AS STRUCT
P.string_value AS name
,COUNT(P.string_value) AS distinct_orders
FROM
UNNEST(CTD.purchases) AS P
WHERE 1=1
AND P.key = 'ware'
GROUP BY
P.string_value
)) AS ware
,ARRAY_AGG((
SELECT AS STRUCT
SUM(P.int_value) AS amount
FROM
UNNEST(CTD.purchases) AS P
WHERE 1=1
AND P.key = 'amount'
GROUP BY
user
,event_date
)) AS total
FROM
CTE_TEST_DATA AS CTD
GROUP BY
CTD.user
,CTD.event_date
which results in this table:
user | event_date | events_per_user.amount | ware.name | ware.distinct_orders | total.amount |
---|---|---|---|---|---|
xxx | 2025-01-01 | 3 | banana | 1 | 2 |
apple | 1 | 3 | |||
banana | 1 | 3 | |||
yyy | 2025-01-01 | 1 | milk | 1 | 1 |
xxx | 2025-01-02 | 1 | bread | 1 | 1 |
I have spent hours researching and trying different methods to achieve this result, unfortunately without success, and I am left wondering if it is even possible.
Any help would be greatly appreciated!
See example.
There ware_totalSum as totalSum
- new metric added to your example.
-- aggregate by user,event_date and array of data values
select user,event_date,sum(events_per_user_ware)as events_per_user
,array_agg(
struct (wareV as ware,ware_distinct_orders as ware_distinct_orders,amount as amount,totalSum as totalSum)
) wd -- ware_data
from ( -- aggregate by user,event_date,wareV
select user,event_date,wareV
,count(*) events_per_user_ware
,count(*) ware_distinct_orders
,sum(amountV) as amount
,sum(priceV*amountV) totalSum
from ( -- extract data values
select user,event_date,store,city
,purchases[0].string_value as typeV
,purchases[1].string_value as wareV
,purchases[2].int_value as amountV
,purchases[3].float_value as priceV
from CTE_TEST_DATA
)raw_data
group by user,event_date,wareV
) agg_data
group by user,event_date
Row | user | event_date | events_per_user | wd.ware | wd.ware_distinct_orders | wd.amount | wd.totalSum |
---|---|---|---|---|---|---|---|
1 | XXX | 2025-01-01 | 3 | banana | 2 | 5 | 2.5 |
apple | 1 | 3 | 1.7999999999999998 | ||||
2 | XXX | 2025-01-02 | 1 | bread | 1 | 1 | 1.0 |
3 | YYY | 2025-01-01 | 1 | milk | 1 | 1 | 1.5 |
Update1
Your restriction on using CTE and on pre-grouping data in general looks unclear. Of course, it is possible to fulfill these requirements. Due to the significant impact on performance.
select user,event_date
,count(*) as events_per_user,sum(purchases[2].int_value) as amount_per_user
--, sum(purchases[2].int_value*purchases[3].float_value)as totalSum_per_user
,(select
array_agg(
struct (wareV as ware,ware_distinct_orders as ware_distinct_orders
,amount as amount,totalSum as totalSum) )
from (
select user,event_date,purchases[1].string_value as wareV
,count(*) as ware_distinct_orders,sum(purchases[2].int_value) as amount
,sum(purchases[2].int_value*purchases[3].float_value) as totalSum
from CTE_TEST_DATA
group by user,event_date,wareV
) rw
where rw.user=t.user and rw.event_date=t.event_date
group by user,event_date
)as wd
from CTE_TEST_DATA t
group by user,event_date;