Search code examples
sqlimpala

Impala query - optimize a query to get the uniques for given key


I'm looking for ways to count unique users that have a specific pkey and also the count of unique users who didn't have that pkey.

Here is a sample table:

 userid |   pkey     | pvalue
------------------------------
   U1   |   x        | vx
   U1   |   y        | vy
   U1   |   z        | vz
   U2   |   y        | vy
   U3   |   z        | vz
   U4   |   null     | null

I get the expected results to get the unique users who has the pkey='y' and those who didn't using this query but turns out to be expensive:

WITH all_rows AS
  ( SELECT userid, 
           IF( pkey='y', pval, 'none' ) AS val,
           SUM( IF(pkey='y',1,0) ) AS has_key
   FROM some_table 
   GROUP BY userid, val)
SELECT val,
       count(distinct(userid)) uniqs
FROM all_rows
WHERE has_key=1
GROUP BY val
UNION ALL
SELECT 'no_key_set' val,
       count(distinct(userid)) uniqs
FROM all_rows a1 LEFT ANTI JOIN 
     all_rows a2 on (a1.userid = a2.userid and a2.has_key=1) 
GROUP BY val;

Results:

val        | uniqs
--------------------
vy         | 2
no_key_set | 2

I'm looking to avoid using any temp tables, so any better ways this can be achieved?

Thanks!


Solution

  • By using EXPLAIN, you can observe that most of the cost is spent on doing excessive GROUP BY aggregations rather than on using subqueries in your original query.

    Here is a straightforward implementation

    WITH t1 AS (
        SELECT pkey, COUNT(*) AS cnt
        FROM table
        WHERE pkey IS NOT NULL
        GROUP BY pkey
    ), t2 AS (
        SELECT COUNT(DISTINCT userid) AS total_cnt
        FROM table
    )
    SELECT
        CONCAT('no_', pkey) AS pkey,
        (total_cnt - cnt) AS cnt
    FROM t1, t2
    UNION ALL
    SELECT * FROM t1
    
    • t1 gets a table of unique user count per pkey

      +------+-----+
      | pkey | cnt |
      +------+-----+
      | x    | 1   |
      | z    | 2   |
      | y    | 2   |
      +------+-----+
      
    • t2 gets the number of total unique users

      +-----------+
      | total_cnt |
      +-----------+
      | 4         |
      +-----------+
      
    • we can use the result from t2 to get the complement table of t1

      +------+-----+
      | pkey | cnt |
      +------+-----+
      | no_x | 3   |
      | no_z | 2   |
      | no_y | 2   |
      +------+-----+
      
    • a final union of the two tables gives a result of

      +------+-----+
      | pkey | cnt |
      +------+-----+
      | no_x | 3   |
      | no_z | 2   |
      | no_y | 2   |
      | x    | 1   |
      | z    | 2   |
      | y    | 2   |
      +------+-----+