Search code examples
sqljoingoogle-bigqueryscalabilitycross-join

BigQuery/SQL: Sum over intervals indicated by a secondary table


Suppose I have two tables: intervals contains index intervals (its columns are i_min and i_max) and values contains indexed values (with columns i and x). Here's an example:

 values:        intervals:
+---+---+   +-------+-------+
| i | x |   | i_min | i_max |
+-------+   +---------------+
| 1 | 1 |   |   1   |   4   |
| 2 | 0 |   |   6   |   6   | 
| 3 | 4 |   |   6   |   6   | 
| 4 | 9 |   |   6   |   6   |
| 6 | 7 |   |   7   |   9   |
| 7 | 2 |   |  12   |  17   |
| 8 | 2 |   +-------+-------+
| 9 | 2 |
+---+---+

I want to sum the values of x for each interval:

       result:
+-------+-------+-----+
| i_min | i_max | sum | 
+---------------------+ 
|   1   |   4   |  13 | // 1+0+4+9
|   6   |   6   |   7 | 
|   6   |   6   |   7 | 
|   6   |   6   |   7 | 
|   7   |   9   |   6 | // 2+2+2
|  12   |  17   |   0 |
+-------+-------+-----+

In some SQL engines, this could be done using:

SELECT
  i_min,
  i_max,
  (SELECT SUM(x)
   FROM values 
   WHERE i BETWEEN intervals.i_min AND intervals.i_max) AS sum_x
FROM
  intervals

except that type of query is not allowed by BigQuery ("Subselect not allowed in SELECT clause." or "LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join." depending on the syntax used).

There must be a way to do this with window functions, but I can't figure out how — all examples I've seen have the partition as part of the table. Is there an option that doesn't use CROSS JOIN? If not, what's the most efficient way to do this CROSS JOIN?

Some notes on my data:

  • Both tables contain many (10⁸-10⁹) rows.
  • There might be repetitions in intervals, not in i.
  • But two intervals in intervals are either the same, either entirely disjoint (no overlaps).
  • The union of all intervals is typically close to the set of all values of i (so it forms a partition of this space).
  • Intervals might be large (say, i_max-i_min < 10⁶).

Solution

  • Try below - BigQuery Standard SQL

    #standardSQL
    SELECT
      i_min, i_max, SUM(x) AS  sum_x
    FROM (
      SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM `project.dataset.intervals`
    ) AS intervals
    JOIN (SELECT i, x FROM `project.dataset.values` UNION ALL SELECT NULL, 0) AS values
    ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL 
    GROUP BY i_min, i_max, line
    -- ORDER BY i_min
    

    you can play/test with dummy data as below

    #standardSQL
    WITH intervals AS (
      SELECT  1 AS i_min, 4 AS i_max UNION ALL
      SELECT  6, 6 UNION ALL
      SELECT  6, 6 UNION ALL
      SELECT  6, 6 UNION ALL
      SELECT  7, 9 UNION ALL
      SELECT 12, 17 
    ),
    values AS (
      SELECT 1 AS i, 1 AS x UNION ALL
      SELECT 2, 0 UNION ALL
      SELECT 3, 4 UNION ALL
      SELECT 4, 9 UNION ALL
      SELECT 6, 7 UNION ALL
      SELECT 7, 2 UNION ALL
      SELECT 8, 2 UNION ALL
      SELECT 9, 2 
    )
    SELECT
      i_min, i_max, SUM(x) AS  sum_x
    FROM (SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM intervals) AS intervals
    JOIN (SELECT i, x FROM values UNION ALL SELECT NULL, 0) AS values
    ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL 
    GROUP BY i_min, i_max, line
    -- ORDER BY i_min