Search code examples
google-bigqueryaggregateanalytics

How can I array aggregate per column where distinct values are less than a given number in Google BigQuery?


I have a dataset table like this in Google Big Query:

| col1 | col2 | col3 | col4 | col5 | col6 |
-------------------------------------------
|  a1  |  b1  |  c1  |  d1  |  e2  |  f1  |
|  a2  |  b2  |  c2  |  d1  |  e2  |  f2  |
|  a1  |  b3  |  c3  |  d1  |  e3  |  f2  |
|  a2  |  b1  |  c4  |  d1  |  e4  |  f2  |
|  a1  |  b2  |  c5  |  d1  |  e5  |  f2  |

Let's say the given threshold number is 4, in that case, I want to transform this into one of the tables given below:

|     col1     |    col2    |    col4    |    col5     |    col6    |
---------------------------------------------------------------------
|    [a1,a2]   |  [b1,b2,b] |    [d1]    |[e2,e3,e4,e5]|   [f1,f2]  |

Or like this:

| col  | values        |
------------------------
| col1 | [a1,a2]       |
| col2 | [b1,b2,b]     |
| col4 | [d1]          |
| col5 | [e2,e3,e4,e5] |
| col6 | [f1,f2]       |

Please note col3 was removed because it contained more than 4 (threshold) distinct values. I explored lot of documents here but was not able to figure out the required query. Can somebody help or point in the right direction ?

Edit: I have one solution in mind, where I do something like this:

select * from (select 'col1', array_aggregate(distinct col1) as values union all
select 'col2', array_aggregate(distinct col2) as values union all
select 'col3', array_aggregate(distinct col3) as values union all
select 'col4', array_aggregate(distinct col4) as values union all
select 'col5', array_aggregate(distinct col5) as values) X where array_length(values) > 4;

This will give me the second result but requires complex query construction assuming I don't know the number and names of the columns up front. Also, this might cross 100MB per row limit for BigQuery table as I will be having more than a billion rows in the table. Please also suggest if there is a better way to do this.


Solution

  • How about:

    WITH arrays AS (
      SELECT * FROM UNNEST((
        SELECT [
          STRUCT("col_repo_name" AS col, ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001) AS values)
          , ('col_actor_login', ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001))
          , ('col_type', ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001))
          , ('col_org_login', ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001))
          ]
        FROM `githubarchive.year.2017` 
      ))
    )
    
    SELECT *
    FROM arrays
    WHERE ARRAY_LENGTH(values)<=1000
    

    This query processed 20.6GB in 11.9s (half billion rows). It only returned one row, because every other row had more than 1000 unique values (my threshold).

    enter image description here

    That's traditional SQL -- but see here an even simpler query, that produces similar results:

    SELECT col, ARRAY_AGG(DISTINCT value IGNORE NULLS LIMIT 1001) values
    FROM (
      SELECT REGEXP_EXTRACT(x, r'"([^\"]*)"') col , REGEXP_EXTRACT(x, r'":"([^\"]*)"') value
      FROM (
        SELECT SPLIT(TO_JSON_STRING(STRUCT(repo.name, actor.login, type, org.login)), ',') x
        FROM `githubarchive.year.2017`
      ), UNNEST(x) x
    )
    GROUP BY col
    HAVING ARRAY_LENGTH(values)<=1000
    
    # 17.0 sec elapsed, 20.6 GB processed
    

    Caveat: This will only run if there are no special values in the columns, like quotes or commas. If you have those, it won't be as straightforward (but still possible).