Search code examples
google-cloud-platformgoogle-bigquerygoogle-analyticsuniversal

How can one unnest an array in BigQuery without using unnest?


I am trying to select hits.page.searchKeyword in a query, but I get the classical nested error. Do I have another option rather than unnest?

When I unnest data from hits (GA3 - UA) all my other selects duplicate their value, and it's a mess to fix. I just need to add the search term to my query, which I have identified in the scheme as hits.page.searchKeyword, and I am done. But I need to unnest the hits array without unnesting function.

SELECT 
  parse_date('%Y%m%d', date) AS date,
  channelGrouping,trafficSource.campaign, 
  hits.page.searchKeyword as searchterm,
  SUM(totals.visits) AS sessions, 
  count(distinct clientId),
  SUM(totals.transactions) AS transactions,
  SUM(totals.totalTransactionRevenue) AS revenue1mil
FROM `vdxl-prod-data-reporting-01.200759185.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20230101' AND '20231102'
GROUP BY 1, channelGrouping, trafficSource.campaign, hits.page.searchKeyword
ORDER BY 1 DESC

Error:

Cannot access field page on a value with type
ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>> at [4:8]

Solution

  • UNNEST() simply turns an array into a relation. It doesn't automatically join that relation with the main table. That only happens if you do something like FROM `vdxl-prod-data-reporting-01.200759185.ga_sessions_*` t cross join t.hits (sometimes the cross join is replaced with its alias ,)

    hits.page.searchKeyword is the internal search and means there can be multiple of these per session - how would you like to aggregate them to session scope?

    One way could be to get a distinct list sorted by these keywords:

    SELECT 
      parse_date('%Y%m%d', date) AS date,
      channelGrouping,
      trafficSource.campaign, 
      (select string_agg(distinct page.searchKeyword order by page.searchKeyword) from unnest(hits)) as searchterm,
      SUM(totals.visits) AS sessions, 
      count(distinct clientId),
      SUM(totals.transactions) AS transactions,
      SUM(totals.totalTransactionRevenue) AS revenue1mil
    FROM `vdxl-prod-data-reporting-01.200759185.ga_sessions_*`
    WHERE _TABLE_SUFFIX BETWEEN '20230101' AND '20231102'
    GROUP BY 1, 2, 3, 4
    ORDER BY 1 DESC
    

    If you want one keyword per row you need the cross join - and of course it repeats session scoped rows for every hit in that row ... that's what joins do. You need to calculate your metrics on hit scope then and can't use session aggregates/totals anymore:

    SELECT 
      parse_date('%Y%m%d', date) AS date,
      channelGrouping,
      trafficSource.campaign, 
      h.page.searchKeyword,
      count(distinct fullvisitorid || visitstarttime) AS sessions, 
      count(distinct clientId) as clients,
      countif(h.ecommerceaction.action_type='6') AS transactions,
      SUM(h.transaction.transactionrevenue /1000000) AS revenue
    FROM `vdxl-prod-data-reporting-01.200759185.ga_sessions_*` as t
      CROSS JOIN t.hits as h
    WHERE 
      _TABLE_SUFFIX BETWEEN '20230101' AND '20231102'
    GROUP BY 
      1, 2, 3, 4
    ORDER BY 
      1 DESC, 2, 3, 4