Search code examples
azure-data-lakeu-sql

Count distinct over multiple columns in USQL


I'd like to count distinct over two columns.

In imaginary sql, I'd write the below:

COUNT(DISTINCT col1, col2)

In T-SQL, this is done by having one grouping in a sub query and then a second grouping in the outer query - like this

As far as I know, USQL doesn't have sub queries, so I'm a bit stuck trying to work out how to do this. I suppose I could use two aliased queries, but I was wondering if there was a better way.


Solution

  • U-SQL does support subqueries. What it does not (yet) support are scalar subqueries and correlated subqueries (use a join for both instead).

    So the solution

    SELECT COUNT(*) AS count
    FROM (SELECT DISTINCT DocumentId, DocumentSessionId
          FROM DocumentOutputItems) AS internalQuery
    

    or the mentioned SUM(CASE) with GROUP BY in the link above both work.

    Alternatively, U-SQL also allows you to name the inner query and use it in the outer statement which makes the script easier to read for more complex cases (it is similar to a common-table expression in that it does only name the query and does not execute it):

    @inner = 
      SELECT DISTINCT DocumentId, DocumentSessionId
      FROM DocumentOutputItems;
    @result = SELECT COUNT(*) AS count FROM @inner;