Search code examples
sqlarraysgoogle-bigquery

Combining and aggregating key/value pairs across different events/rows


I have a table consisting of customer events. These events have arrays containing keys and values.

The following simulated table illustrates what these events could look like:

user event_date event_timestamp store city purchases.key purchases.int_value purchases.float_value purchases.string_value
XXX 2025-01-01 1735693200 store_1 london type online
ware banana
amount 2
price 0.5
XXX 2025-01-01 1735693201 store_1 london type online
ware apple
amount 3
price 0.6
XXX 2025-01-01 1735693202 store_1 london type online
ware banana
amount 3
price 0.5
XXX 2025-01-02 1735693203 store_1 london type online
ware bread
amount 1
price 1.0
YYY 2025-01-01 1735693204 store_2 sydney type physical
ware milk
amount 1
price 1.5
WITH
  CTE_TEST_DATA AS (
    SELECT
      'XXX'     AS user
      ,'2025-01-01' AS event_date
      ,'store_1'    AS store
      ,'london'     AS city
      ,[
          STRUCT('type' AS key, NULL AS int_value, NULL as float_value, 'online' AS string_value),
          STRUCT('ware', NULL, NULL, 'banana'),
          STRUCT('amount', 2, NULL, NULL),
          STRUCT('price', NULL, 0.5, NULL)
        ] AS purchases
    UNION ALL
    SELECT
      'XXX'     AS user
      ,'2025-01-01' AS event_date
      ,'store_1'    AS store
      ,'london'     AS city
      ,[
          STRUCT('type', NULL, NULL, 'online'),
          STRUCT('ware', NULL, NULL, 'apple'),
          STRUCT('amount', 3, NULL, NULL),
          STRUCT('price', NULL, 0.6, NULL)
        ] AS purchases
    UNION ALL
    SELECT
      'XXX'     AS user
      ,'2025-01-01' AS event_date
      ,'store_1'    AS store
      ,'london'     AS city
      ,[
          STRUCT('type', NULL, NULL, 'online'),
          STRUCT('ware', NULL, NULL, 'banana'),
          STRUCT('amount', 3, NULL, NULL),
          STRUCT('price', NULL, 0.5, NULL)
        ] AS purchases
    UNION ALL
    SELECT
      'XXX'     AS user
      ,'2025-01-02' AS event_date
      ,'store_1'    AS store
      ,'london'     AS city
      ,[
          STRUCT('type', NULL, NULL, 'online'),
          STRUCT('ware', NULL, NULL, 'bread'),
          STRUCT('amount', 1, NULL, NULL),
          STRUCT('price', NULL, 1.0, NULL)
        ] AS purchases
    UNION ALL
    SELECT
      'YYY'     AS user
      ,'2025-01-01' AS event_date
      ,'store_2'    AS store
      ,'sydney'     AS city
      ,[
          STRUCT('type', NULL, NULL, 'physical'),
          STRUCT('ware', NULL, NULL, 'milk'),
          STRUCT('amount', 1, NULL, NULL),
          STRUCT('price', NULL, 1.5, NULL)
        ] AS purchases
    )

The actual Customer Events table is quite big, hence a daily aggregation query has been created to make analysis and reporting less costly.

This daily aggregation query yields aggregated results per user per event_date:

SELECT
   CTD.user
  ,CTD.event_date
  ,STRUCT(COUNT(*) as amount) AS events_per_user
  ,*๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐‘ *
FROM
  CTE_TEST_DATA  AS CTD
GROUP BY
   CTD.user
  ,CTD.event_date

When new metrics are added to the Customer Events table, the daily aggregation query must be modified to include these.

Because of this setup, I have a few constraints when adding aggregated columns, the main ones being that new metrics must be added exclusively using correlated subqueries, and that Common Table Expressions cannot be used.

The result I am trying to achieve is the following, where the array contains only distinct ware names:

user event_date events_per_user.amount ware.name ware.distinct_orders total.amount
xxx 2025-01-01 3 banana 2 5
apple 1 3
yyy 2025-01-01 1 milk 1 1
xxx 2025-01-02 1 bread 1 1

Unfortunately, the closest I have been to achieve that result, is using the following query:

SELECT
   CTD.user
  ,CTD.event_date
  ,STRUCT(COUNT(*) as amount)   AS events_per_user
  ,ARRAY_AGG((
    SELECT AS STRUCT
       P.string_value           AS name
      ,COUNT(P.string_value)    AS distinct_orders
    FROM
      UNNEST(CTD.purchases) AS P
    WHERE 1=1
      AND P.key = 'ware'
    GROUP BY
      P.string_value
  )) AS ware
  ,ARRAY_AGG((
    SELECT AS STRUCT
      SUM(P.int_value)          AS amount
    FROM
      UNNEST(CTD.purchases) AS P
    WHERE 1=1
      AND P.key = 'amount'
    GROUP BY
       user
      ,event_date
  )) AS total
FROM
  CTE_TEST_DATA AS CTD
GROUP BY
   CTD.user
  ,CTD.event_date

which results in this table:

user event_date events_per_user.amount ware.name ware.distinct_orders total.amount
xxx 2025-01-01 3 banana 1 2
apple 1 3
banana 1 3
yyy 2025-01-01 1 milk 1 1
xxx 2025-01-02 1 bread 1 1

I have spent hours researching and trying different methods to achieve this result, unfortunately without success, and I am left wondering if it is even possible.

Any help would be greatly appreciated!


Solution

  • See example.
    There ware_totalSum as totalSum - new metric added to your example.

      -- aggregate by user,event_date and array of data values
     select user,event_date,sum(events_per_user_ware)as events_per_user
         ,array_agg(
             struct (wareV as ware,ware_distinct_orders as ware_distinct_orders,amount as amount,totalSum as totalSum)
         ) wd -- ware_data
      from ( -- aggregate by user,event_date,wareV
         select user,event_date,wareV
          ,count(*) events_per_user_ware
          ,count(*) ware_distinct_orders
          ,sum(amountV) as amount
          ,sum(priceV*amountV) totalSum
        from (  -- extract data values
          select user,event_date,store,city 
            ,purchases[0].string_value as typeV
            ,purchases[1].string_value as wareV
            ,purchases[2].int_value as amountV
            ,purchases[3].float_value as priceV
          from CTE_TEST_DATA
        )raw_data
        group by user,event_date,wareV
      ) agg_data
      group by user,event_date
    
    Row user event_date events_per_user wd.ware wd.ware_distinct_orders wd.amount wd.totalSum
    1 XXX 2025-01-01 3 banana 2 5 2.5
    apple 1 3 1.7999999999999998
    2 XXX 2025-01-02 1 bread 1 1 1.0
    3 YYY 2025-01-01 1 milk 1 1 1.5

    Update1

    Your restriction on using CTE and on pre-grouping data in general looks unclear. Of course, it is possible to fulfill these requirements. Due to the significant impact on performance.

      select user,event_date
           ,count(*) as events_per_user,sum(purchases[2].int_value) as amount_per_user 
         --, sum(purchases[2].int_value*purchases[3].float_value)as totalSum_per_user
           ,(select 
               array_agg(
                 struct (wareV as ware,ware_distinct_orders as ware_distinct_orders
                          ,amount as amount,totalSum as totalSum) )
             from (
                select user,event_date,purchases[1].string_value as wareV
                  ,count(*) as ware_distinct_orders,sum(purchases[2].int_value) as amount
                  ,sum(purchases[2].int_value*purchases[3].float_value) as totalSum
                from CTE_TEST_DATA 
                group by user,event_date,wareV
               ) rw
             where rw.user=t.user and rw.event_date=t.event_date
             group by user,event_date
            )as wd
      from  CTE_TEST_DATA t 
      group by user,event_date;