Search code examples
sqlpostgresqlgoogle-bigquery

How to get the First N values of multiple columns?


Let's say I have the following table:

WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost
) SELECT * FROM tbl

How would I get N distinct values for each column? For example, similar to the "Value distribution" (bottom right) from Power BI shown here:

enter image description here

I don't care about the counts of each value, but just to pull a representative sample of values (say, up to ten values).

For example, to display a sample of values without having to run a query for each column? That is, I'd like to grab them all in one-go. So far I have something like:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
) 
SELECT 
    ARRAY_AGG(DISTINCT id LIMIT 2),
    ARRAY_AGG(DISTINCT product LIMIT 2),
    ARRAY_AGG(DISTINCT cost LIMIT 2)
FROM tbl

This works, but it seems very inefficient (I believe the same as running a query for each column). What is a better way to do this?

Or, to generalize what I think to be a poor approach but applicable outside BQ:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
)  
select 'id' as field, array(select distinct cast(id as string) from tbl limit 2) as values union all
select 'product', array(select distinct cast(product as string) from tbl limit 2) union all
select 'cost', array(select distinct cast(cost as string) from tbl limit 2);

And a further improvement based on Mikhail's answer:

WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost, true as is_big, date '2014-01-01' as d UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost, false as is_big, date '2017-01-01' as d UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost, false as is_big, date '2015-01-01' as d union all
    SELECT 7 AS id, "Several" AS product, 103 AS cost, true as is_big, date '2016-01-01' as d
) 
SELECT 
  name, 
  IF(
    array_length(quantiles) is not null, 
    ARRAY(SELECT CAST(tmp AS STRING) FROM UNNEST(quantiles) tmp), 
    ARRAY(SELECT value FROM t.top_values)
  ) values
FROM ML.DESCRIBE_DATA(
  (SELECT * FROM tbl), STRUCT(3 AS num_quantiles, 4 AS top_k)
) t;

Solution

  • A skip scan with a recursive cte would do exactly what you want, exactly how you want it: run just once and iteratively grab values that haven't already been collected, until it collects your desired number of distinct samples for each column. PostgreSQL has it, BigQuery has it too.

    On 300k random rows with 3 columns holding 1k, 190 and 10k unique values respectively, it takes 0.12ms to get 3 samples, 1.2ms to get 30, 40.0ms to get 120 of each, with no index: demo at db<>fiddle

    prepare recursive_cte_3way_skip_scan(int) as 
    with recursive cte as (
    ( select   1 as i
              , array[id] ids 
              , array[product] products
              , array[cost] costs
      from tbl limit 1) --get 1 random, most convenient value
      union all
      select cte.i+1 as i
           ,  (select array[id] ids 
               from tbl 
               where array_position(cte.ids,tbl.id) is null 
               limit 1) ||cte.ids
           ,  (select array[product] products 
               from tbl 
               where array_position(cte.products,tbl.product) is null
               limit 1) ||cte.products
           ,  (select array[cost] costs 
               from tbl 
               where array_position(cte.costs,tbl.cost) is null
               limit 1) ||cte.costs
      from cte where $1>i
    ) 
    select ids, products, costs
    from cte order by i desc limit 1;
    

    I'm using array_position() instead of <>all() to be able to catch null values. The latter uses regular equality = so it will never look for or match against them. The former uses is not distinct from construct instead, so it can handle them fine.

    ids products costs
    {94,937,743} {"Mouse Pad",Floor,Fork} {213.9,64.3,362.5}

    You could come up with a scenario where your assumption isn't correct:

    This works, but it seems very inefficient (I believe the same as running a query for each column).

    Having "a query for each column" could actually be desired if you have indexes on them. PostgreSQL will split it up into 3 independent InitPlans against the 3 indexes, popping your desired number of examples from each of them.

    In the same test, it takes 0.9ms, and since this is just taking N steps up a BTree, you can expect this to scale linearly relative to the target sample size of N, regardless of how big your data set is:

    create index idx1 on tbl (id);
    create index idx2 on tbl (product);
    create index idx3 on tbl (cost);
    vacuum analyze tbl;
    
    prepare subqueries_three as
    select (select array_agg(id) from (select distinct id from tbl limit 3)_)
          ,(select array_agg(product) from (select distinct product from tbl limit 3)_)
          ,(select array_agg(cost) from (select distinct cost from tbl limit 3)_);
    
    explain analyze verbose execute subqueries_three ;
    execute subqueries_three;
    
    QUERY PLAN
    Result (cost=126.35..126.36 rows=1 width=96) (actual time=0.829..0.831 rows=1 loops=1)
      Output: $0, $1, $2
      InitPlan 1 (returns $0)
                          -> Index Only Scan using idx1 on public.tbl (cost=0.42..5724.42 rows=300000 width=4) (actual time=0.041..0.081 rows=430 loops=1)
                                Output: tbl.id
      InitPlan 2 (returns $1)
                          -> Index Only Scan using idx2 on public.tbl tbl_1 (cost=0.42..5772.42 rows=300000 width=8) (actual time=0.034..0.365 rows=3155 loops=1)
                                Output: tbl_1.product
      InitPlan 3 (returns $2)
                          -> Index Only Scan using idx3 on public.tbl tbl_2 (cost=0.42..8000.42 rows=300000 width=6) (actual time=0.037..0.043 rows=54 loops=1)
                                Output: tbl_2.cost
    Planning Time: 1.046 ms
    Execution Time: 0.900 ms
    array_agg array_agg array_agg
    {0,1,2} {"Air Freshener",Apple,Bag} {0.0,0.1,0.2}

    Still, if you do have the indexes set up, you can remix the skip scan earlier into an index skip scan. It's the 3rd one down in the demo, the quickest one of them all, grabbing the 120 samples in 5ms.

    Probing a data set with no indexes in place is exactly what tablesample, already mentioned by @Erwin Brandstetter, was made for. The fact that distinct only has to go through a small portion of the table can save most of the work.
    The exec time is driven by that sample size - if you lower it too much you increase the risk the slice you get doesn't happen to have your desired number of examples:

    SELECT 
        (ARRAY_AGG(DISTINCT id))[:3],
        (ARRAY_AGG(DISTINCT product))[:3],
        (ARRAY_AGG(DISTINCT cost))[:3]
    FROM tbl tablesample system(.1) ;
    
    QUERY PLAN
    Aggregate (cost=26.35..26.36 rows=1 width=96) (actual time=0.467..0.468 rows=1 loops=1)
      Output: (array_agg(DISTINCT id))[:3], (array_agg(DISTINCT product))[:3], (array_agg(DISTINCT cost))[:3]
      -> Sort (cost=23.34..24.09 rows=300 width=18) (actual time=0.119..0.138 rows=321 loops=1)
            Output: id, product, cost
            Sort Key: tbl.id
            Sort Method: quicksort Memory: 39kB
            -> Sample Scan on public.tbl (cost=0.00..11.00 rows=300 width=18) (actual time=0.013..0.061 rows=321 loops=1)
                  Output: id, product, cost
                  Sampling: system ('0.1'::real)
    Planning Time: 0.092 ms
    Execution Time: 0.494 ms
    array_agg array_agg array_agg
    {10,15,33} {"Air Freshener",Apple,Bag} {0.03,1.27,1.42}

    With indexes, the index skip scan can still perform comparably, without the risk of hitting a non-representative slice or the need to tweak the parameter.
    Without them, as long as your values are evenly distributed or you're lucky, that's the simplest way to speed things up.