Search code examples
sqlsortinggoogle-bigquerywindow-functionsdatabase-partitioning

Google BigQuery SQL: Order two columns independently


Say I have some data like:

grp   v1   v2
---   --   --
 2    5    7
 2    4    9
 3    10   2
 3    11   1

I'd like to create new columns which are independent of the ordering of the table - such that the two columns have independent orderings, i.e. sort by v1 independently of v2, while partitioning by grp.

The result (independently ordered, partitioned by grp) would be:

grp   v1   v2  v1_ordered v2_ordered
---   --   --  ---------- ----------
 2    5    7       4          7
 2    4    9       5          9
 3    10   2      10          1
 3    11   1      11          2

One way to do this is to create two tables and CROSS JOIN. However, I'm working with too many rows of data for this to be computationally tractable - is there a way to do this within a single query without a JOIN?

Basically, I'd like to write SQL like:

SELECT
  *,
  v1 OVER (PARTITION BY grp ORDER BY v1 ASC) as v1_ordered,
  v2 OVER (PARTITION BY grp ORDER BY v2 ASC) as v2_ordered
FROM [example_table]

This breaks table row meaning, but it's a necessary feature for many applications - for example computing ordered correlation between two fields CORR(v1_ordered, v2_ordered).

Is this possible?


Solution

  • I think you are in right direction! You just need to use proper window function . Row_number() in this case. And it should work!

    Adding working example as per @cgn request:
    I dont think there is way to totally avoid use of JOIN.
    At the same time below example uses just ONE JOIN vs TWO JOINs in other answers:

    SELECT 
      a.grp AS grp, 
      a.v1 AS v1, 
      a.v2 AS v2, 
      a.v1 AS v1_ordered, 
      b.v2 AS v2_ordered 
    FROM (
      SELECT grp, v1, v2, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY v1) AS v1_order
      FROM [example_table]
    ) AS a
    JOIN (
      SELECT grp, v1, v2, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY v2) AS v2_order
      FROM [example_table]
    ) AS b
    ON a.grp = b.grp AND a.v1_order = b.v2_order 
    

    Result is as expected:

    grp v1  v2  v1_ordered  v2_ordered   
    2    4   9           4           7   
    2    5   7           5           9   
    3   10   2          10           1   
    3   11   1          11           2   
    

    And now you can use CORR() as below

    SELECT grp, CORR(v1_ordered, v2_ordered) AS [corr]
    FROM (
      SELECT 
        a.grp AS grp, 
        a.v1 AS v1, 
        a.v2 AS v2, 
        a.v1 AS v1_ordered, 
        b.v2 AS v2_ordered 
      FROM (
        SELECT grp, v1, v2, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY v1) AS v1_order
        FROM [example_table]
      ) AS a
      JOIN (
        SELECT grp, v1, v2, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY v2) AS v2_order
        FROM [example_table]
      ) AS b
      ON a.grp = b.grp AND a.v1_order = b.v2_order
    )
    GROUP BY grp