I have a dataset table like this in Google Big Query:
| col1 | col2 | col3 | col4 | col5 | col6 |
-------------------------------------------
| a1 | b1 | c1 | d1 | e2 | f1 |
| a2 | b2 | c2 | d1 | e2 | f2 |
| a1 | b3 | c3 | d1 | e3 | f2 |
| a2 | b1 | c4 | d1 | e4 | f2 |
| a1 | b2 | c5 | d1 | e5 | f2 |
Let's say the given threshold number is 4, in that case, I want to transform this into one of the tables given below:
| col1 | col2 | col4 | col5 | col6 |
---------------------------------------------------------------------
| [a1,a2] | [b1,b2,b] | [d1] |[e2,e3,e4,e5]| [f1,f2] |
Or like this:
| col | values |
------------------------
| col1 | [a1,a2] |
| col2 | [b1,b2,b] |
| col4 | [d1] |
| col5 | [e2,e3,e4,e5] |
| col6 | [f1,f2] |
Please note col3 was removed because it contained more than 4 (threshold) distinct values. I explored lot of documents here but was not able to figure out the required query. Can somebody help or point in the right direction ?
Edit: I have one solution in mind, where I do something like this:
select * from (select 'col1', array_aggregate(distinct col1) as values union all
select 'col2', array_aggregate(distinct col2) as values union all
select 'col3', array_aggregate(distinct col3) as values union all
select 'col4', array_aggregate(distinct col4) as values union all
select 'col5', array_aggregate(distinct col5) as values) X where array_length(values) > 4;
This will give me the second result but requires complex query construction assuming I don't know the number and names of the columns up front. Also, this might cross 100MB per row limit for BigQuery table as I will be having more than a billion rows in the table. Please also suggest if there is a better way to do this.
How about:
WITH arrays AS (
SELECT * FROM UNNEST((
SELECT [
STRUCT("col_repo_name" AS col, ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001) AS values)
, ('col_actor_login', ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001))
, ('col_type', ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001))
, ('col_org_login', ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001))
]
FROM `githubarchive.year.2017`
))
)
SELECT *
FROM arrays
WHERE ARRAY_LENGTH(values)<=1000
This query processed 20.6GB in 11.9s (half billion rows). It only returned one row, because every other row had more than 1000 unique values (my threshold).
That's traditional SQL -- but see here an even simpler query, that produces similar results:
SELECT col, ARRAY_AGG(DISTINCT value IGNORE NULLS LIMIT 1001) values
FROM (
SELECT REGEXP_EXTRACT(x, r'"([^\"]*)"') col , REGEXP_EXTRACT(x, r'":"([^\"]*)"') value
FROM (
SELECT SPLIT(TO_JSON_STRING(STRUCT(repo.name, actor.login, type, org.login)), ',') x
FROM `githubarchive.year.2017`
), UNNEST(x) x
)
GROUP BY col
HAVING ARRAY_LENGTH(values)<=1000
# 17.0 sec elapsed, 20.6 GB processed
Caveat: This will only run if there are no special values in the columns, like quotes or commas. If you have those, it won't be as straightforward (but still possible).