Let's say I have the following table:
WITH tbl AS (
SELECT 1 AS id, "Phone" AS product, 105 AS cost UNION ALL
SELECT 2 AS id, "Camera" AS product, 82 AS cost UNION ALL
SELECT 3 AS id, "Cup" AS product, 103 AS cost
) SELECT * FROM tbl
How would I get N distinct values for each column? For example, similar to the "Value distribution" (bottom right) from Power BI shown here:
I don't care about the counts of each value, but just to pull a representative sample of values (say, up to ten values).
For example, to display a sample of values without having to run a query for each column? That is, I'd like to grab them all in one-go. So far I have something like:
WITH tbl AS (
SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
SELECT 3 AS id, 'Cup' AS product, 103 AS cost
)
SELECT
ARRAY_AGG(DISTINCT id LIMIT 2),
ARRAY_AGG(DISTINCT product LIMIT 2),
ARRAY_AGG(DISTINCT cost LIMIT 2)
FROM tbl
This works, but it seems very inefficient (I believe the same as running a query for each column). What is a better way to do this?
Or, to generalize what I think to be a poor approach but applicable outside BQ:
WITH tbl AS (
SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
SELECT 3 AS id, 'Cup' AS product, 103 AS cost
)
select 'id' as field, array(select distinct cast(id as string) from tbl limit 2) as values union all
select 'product', array(select distinct cast(product as string) from tbl limit 2) union all
select 'cost', array(select distinct cast(cost as string) from tbl limit 2);
And a further improvement based on Mikhail's answer:
WITH tbl AS (
SELECT 1 AS id, "Phone" AS product, 105 AS cost, true as is_big, date '2014-01-01' as d UNION ALL
SELECT 2 AS id, "Camera" AS product, 82 AS cost, false as is_big, date '2017-01-01' as d UNION ALL
SELECT 3 AS id, "Cup" AS product, 103 AS cost, false as is_big, date '2015-01-01' as d union all
SELECT 7 AS id, "Several" AS product, 103 AS cost, true as is_big, date '2016-01-01' as d
)
SELECT
name,
IF(
array_length(quantiles) is not null,
ARRAY(SELECT CAST(tmp AS STRING) FROM UNNEST(quantiles) tmp),
ARRAY(SELECT value FROM t.top_values)
) values
FROM ML.DESCRIBE_DATA(
(SELECT * FROM tbl), STRUCT(3 AS num_quantiles, 4 AS top_k)
) t;
A skip scan with a recursive cte
would do exactly what you want, exactly how you want it: run just once and iteratively grab values that haven't already been collected, until it collects your desired number of distinct samples for each column. PostgreSQL has it, BigQuery has it too.
On 300k random rows with 3 columns holding 1k, 190 and 10k unique values respectively, it takes 0.12ms
to get 3 samples, 1.2ms
to get 30, 40.0ms
to get 120 of each, with no index: demo at db<>fiddle
prepare recursive_cte_3way_skip_scan(int) as
with recursive cte as (
( select 1 as i
, array[id] ids
, array[product] products
, array[cost] costs
from tbl limit 1) --get 1 random, most convenient value
union all
select cte.i+1 as i
, (select array[id] ids
from tbl
where array_position(cte.ids,tbl.id) is null
limit 1) ||cte.ids
, (select array[product] products
from tbl
where array_position(cte.products,tbl.product) is null
limit 1) ||cte.products
, (select array[cost] costs
from tbl
where array_position(cte.costs,tbl.cost) is null
limit 1) ||cte.costs
from cte where $1>i
)
select ids, products, costs
from cte order by i desc limit 1;
I'm using array_position()
instead of <>all()
to be able to catch null
values. The latter uses regular equality =
so it will never look for or match against them. The former uses is not distinct from
construct instead, so it can handle them fine.
ids | products | costs |
---|---|---|
{94,937,743} | {"Mouse Pad",Floor,Fork} | {213.9,64.3,362.5} |
You could come up with a scenario where your assumption isn't correct:
This works, but it seems very inefficient (I believe the same as running a query for each column).
Having "a query for each column" could actually be desired if you have indexes on them. PostgreSQL will split it up into 3 independent InitPlans against the 3 indexes, popping your desired number of examples from each of them.
In the same test, it takes 0.9ms
, and since this is just taking N steps up a BTree, you can expect this to scale linearly relative to the target sample size of N, regardless of how big your data set is:
create index idx1 on tbl (id);
create index idx2 on tbl (product);
create index idx3 on tbl (cost);
vacuum analyze tbl;
prepare subqueries_three as
select (select array_agg(id) from (select distinct id from tbl limit 3)_)
,(select array_agg(product) from (select distinct product from tbl limit 3)_)
,(select array_agg(cost) from (select distinct cost from tbl limit 3)_);
explain analyze verbose execute subqueries_three ;
execute subqueries_three;
QUERY PLAN |
---|
Result (cost=126.35..126.36 rows=1 width=96) (actual time=0.829..0.831 rows=1 loops=1) |
Output: $0, $1, $2 |
InitPlan 1 (returns $0) |
-> Index Only Scan using idx1 on public.tbl (cost=0.42..5724.42 rows=300000 width=4) (actual time=0.041..0.081 rows=430 loops=1) |
Output: tbl.id |
InitPlan 2 (returns $1) |
-> Index Only Scan using idx2 on public.tbl tbl_1 (cost=0.42..5772.42 rows=300000 width=8) (actual time=0.034..0.365 rows=3155 loops=1) |
Output: tbl_1.product |
InitPlan 3 (returns $2) |
-> Index Only Scan using idx3 on public.tbl tbl_2 (cost=0.42..8000.42 rows=300000 width=6) (actual time=0.037..0.043 rows=54 loops=1) |
Output: tbl_2.cost |
Planning Time: 1.046 ms |
Execution Time: 0.900 ms |
array_agg | array_agg | array_agg |
---|---|---|
{0,1,2} | {"Air Freshener",Apple,Bag} | {0.0,0.1,0.2} |
Still, if you do have the indexes set up, you can remix the skip scan earlier into an index skip scan. It's the 3rd one down in the demo, the quickest one of them all, grabbing the 120 samples in 5ms
.
Probing a data set with no indexes in place is exactly what tablesample
, already mentioned by @Erwin Brandstetter, was made for. The fact that distinct
only has to go through a small portion of the table can save most of the work.
The exec time is driven by that sample size - if you lower it too much you increase the risk the slice you get doesn't happen to have your desired number of examples:
SELECT
(ARRAY_AGG(DISTINCT id))[:3],
(ARRAY_AGG(DISTINCT product))[:3],
(ARRAY_AGG(DISTINCT cost))[:3]
FROM tbl tablesample system(.1) ;
QUERY PLAN |
---|
Aggregate (cost=26.35..26.36 rows=1 width=96) (actual time=0.467..0.468 rows=1 loops=1) |
Output: (array_agg(DISTINCT id))[:3], (array_agg(DISTINCT product))[:3], (array_agg(DISTINCT cost))[:3] |
-> Sort (cost=23.34..24.09 rows=300 width=18) (actual time=0.119..0.138 rows=321 loops=1) |
Output: id, product, cost |
Sort Key: tbl.id |
Sort Method: quicksort Memory: 39kB |
-> Sample Scan on public.tbl (cost=0.00..11.00 rows=300 width=18) (actual time=0.013..0.061 rows=321 loops=1) |
Output: id, product, cost |
Sampling: system ('0.1'::real) |
Planning Time: 0.092 ms |
Execution Time: 0.494 ms |
array_agg | array_agg | array_agg |
---|---|---|
{10,15,33} | {"Air Freshener",Apple,Bag} | {0.03,1.27,1.42} |
With indexes, the index skip scan can still perform comparably, without the risk of hitting a non-representative slice or the need to tweak the parameter.
Without them, as long as your values are evenly distributed or you're lucky, that's the simplest way to speed things up.