Search code examples
arraysgoogle-bigqueryunnest

Unnest multiple arrays in Bigquery and aggregate again


I'm trying to unnest multiple nested arrays in Bigquery, filter on them and put the new arrays back together. My problem is, that I'm ending up with duplicated values.

Example data:

Example of nested rows

The image has two example rows of data with an array "vendor" that contains the two arrays "topic" and "categories"

I want to filter on vendor.topic.score >= 0.8, vendor.categories.score >= 0.8 and also get rid of the vendor.topic.position column.

The result should look like this:

Result example

First I've tried to solve it with multiple unnest for each array, but this gives me duplicated values in the newly created arrays:

SELECT
  id,
  ARRAY_AGG(STRUCT(vendor_topics.label AS topics_label,
      vendor_topics.score AS topics_score)),
  ARRAY_AGG(STRUCT(vendor_categories.label AS category_label,
      vendor_categories.score AS category_score))
FROM
  `source_table`,
  UNNEST(vendor.topics) vendor_topics,
  UNNEST(vendor.categories) vendor_categories
WHERE
  AND vendor_categories.score >= 0.8
  AND vendor_topics.score >= 0.8
GROUP BY
  1
LIMIT
  10

Next I've tried using subqueries, which ended up in "API limit exceeded: Unable to return a row that exceeds the API limits. To retrieve the row, export the table."

SELECT
  id,
  (
  SELECT
    ARRAY_AGG(STRUCT(vendor_topics.label AS topics_label,
        vendor_topics.score AS topics_score))
  FROM
    `source_table` articles,
    UNNEST(vendor.topics) vendor_topics
    WHERE
  vendor_topics.score >= 0.8),

    (
  SELECT
    ARRAY_AGG(STRUCT(vendor_categories.label AS category_label,
      vendor_categories.score AS category_score))
  FROM
    `source_table`,
    UNNEST(vendor.categories) vendor_categories
    WHERE
  vendor_categories.score >= 0.8)
FROM
  `source_table`
GROUP BY
  1

Now I'm out of ideas and hoping someone can help me solve this please.


Solution

  • I build out your sample data as well in two ways as not sure if vendor is array or not an array. From which you may have the complications.

    1st example vendor is array

    #standardSQL
    WITH `yourTable` AS (
      select 111 as id, (select 
              array(select 
                       struct(array(select struct('A' as label, 0.1 as score,2 as position) 
                             union all select struct('B' as label, 0.9 as score,5 as position)
                             union all select struct('C' as label, 0.9 as score,8 as position)
                                  ) as topic,
                                  array(select struct('Cat1' as label, 0.8 as score) 
                             union all select struct('Cat2' as label, 0.3 as score)
                                  ) as categories 
                              )
                      )) as vendor
        union all 
        select 222 as id, (select 
              array(select 
                       struct(array(select struct('X' as label, 0.3 as score,2 as position) 
                             union all select struct('Y' as label, 0.9 as score,3 as position)
                                  ) as topic,
                                  array(select struct('Cat33' as label, 0.9 as score) 
                             union all select struct('Cat99' as label, 0.4 as score)
                             union all select struct('Cat44' as label, 0.85 as score)
                                  ) as categories 
                              )
                      )) as vendor
    )
     ------
      SELECT id,array(
      select struct(
                  (select array_agg(t) as topic from unnest(vendor),unnest(topic) t where t.score>=0.8) as topic,
                  (select array_agg(t) as categories from unnest(vendor),unnest(categories) t where t.score>=0.8) as categories
                )
       ) as vendor2   from yourTable
    

    This returns:

    enter image description here

    Basically how you need to read this is: - you are selecting a row with id and vendor2 - by nature vendor2 is an array (skip this for second example) - you then need two keys as struct topic and categories - topic or categories are an array of structs.

    2nd example (where vendor is not an array):

    #standardSQL
    WITH `yourTable` AS (
      select 111 as id, (select 
                       struct(array(select struct('A' as label, 0.1 as score,2 as position) 
                             union all select struct('B' as label, 0.9 as score,5 as position)
                             union all select struct('C' as label, 0.9 as score,8 as position)
                                  ) as topic,
                                  array(select struct('Cat1' as label, 0.8 as score) 
                             union all select struct('Cat2' as label, 0.3 as score)
                                  ) as categories 
                              )
                      ) as vendor
        union all 
        select 222 as id, (select 
                       struct(array(select struct('X' as label, 0.3 as score,2 as position) 
                             union all select struct('Y' as label, 0.9 as score,3 as position)
                                  ) as topic,
                                  array(select struct('Cat33' as label, 0.9 as score) 
                             union all select struct('Cat99' as label, 0.4 as score)
                             union all select struct('Cat44' as label, 0.85 as score)
                                  ) as categories 
                              )
                      ) as vendor
    )
    
    
     SELECT id,struct(
                  (select array_agg(t) as topic from unnest(vendor.topic) t where t.score>=0.8) as topic,
                  (select array_agg(t) as categories from unnest(vendor.categories) t where t.score>=0.8) as categories
       ) as vendor2   from yourTable