How can I array aggregate per column where distinct values are less than a given number in Google BigQuery?

I have a dataset table like this in Google Big Query:

| col1 | col2 | col3 | col4 | col5 | col6 |
-------------------------------------------
|  a1  |  b1  |  c1  |  d1  |  e2  |  f1  |
|  a2  |  b2  |  c2  |  d1  |  e2  |  f2  |
|  a1  |  b3  |  c3  |  d1  |  e3  |  f2  |
|  a2  |  b1  |  c4  |  d1  |  e4  |  f2  |
|  a1  |  b2  |  c5  |  d1  |  e5  |  f2  |

Let's say the given threshold number is 4, in that case, I want to transform this into one of the tables given below:

|     col1     |    col2    |    col4    |    col5     |    col6    |
---------------------------------------------------------------------
|    [a1,a2]   |  [b1,b2,b] |    [d1]    |[e2,e3,e4,e5]|   [f1,f2]  |

Or like this:

| col  | values        |
------------------------
| col1 | [a1,a2]       |
| col2 | [b1,b2,b]     |
| col4 | [d1]          |
| col5 | [e2,e3,e4,e5] |
| col6 | [f1,f2]       |

Please note col3 was removed because it contained more than 4 (threshold) distinct values. I explored lot of documents here but was not able to figure out the required query. Can somebody help or point in the right direction ?

Edit: I have one solution in mind, where I do something like this:

select * from (select 'col1', array_aggregate(distinct col1) as values union all
select 'col2', array_aggregate(distinct col2) as values union all
select 'col3', array_aggregate(distinct col3) as values union all
select 'col4', array_aggregate(distinct col4) as values union all
select 'col5', array_aggregate(distinct col5) as values) X where array_length(values) > 4;

This will give me the second result but requires complex query construction assuming I don't know the number and names of the columns up front. Also, this might cross 100MB per row limit for BigQuery table as I will be having more than a billion rows in the table. Please also suggest if there is a better way to do this.

Solution

How about:

WITH arrays AS (
  SELECT * FROM UNNEST((
    SELECT [
      STRUCT("col_repo_name" AS col, ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001) AS values)
      , ('col_actor_login', ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001))
      , ('col_type', ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001))
      , ('col_org_login', ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001))
      ]
    FROM `githubarchive.year.2017` 
  ))
)

SELECT *
FROM arrays
WHERE ARRAY_LENGTH(values)<=1000

This query processed 20.6GB in 11.9s (half billion rows). It only returned one row, because every other row had more than 1000 unique values (my threshold).

That's traditional SQL -- but see here an even simpler query, that produces similar results:

SELECT col, ARRAY_AGG(DISTINCT value IGNORE NULLS LIMIT 1001) values
FROM (
  SELECT REGEXP_EXTRACT(x, r'"([^\"]*)"') col , REGEXP_EXTRACT(x, r'":"([^\"]*)"') value
  FROM (
    SELECT SPLIT(TO_JSON_STRING(STRUCT(repo.name, actor.login, type, org.login)), ',') x
    FROM `githubarchive.year.2017`
  ), UNNEST(x) x
)
GROUP BY col
HAVING ARRAY_LENGTH(values)<=1000

# 17.0 sec elapsed, 20.6 GB processed

Caveat: This will only run if there are no special values in the columns, like quotes or commas. If you have those, it won't be as straightforward (but still possible).