Search code examples
sqlgoogle-bigqueryaggregationdata-analysiscross-join

Row Aggregation after Cross Join in BigQuery


Say you have the following table in BigQuery:

A = user1 | 0 0 |
    user2 | 0 3 |
    user3 | 4 0 |

After a cross join, you have

dist = |user1  user2  0 0 , 0 3 |  #comma is just showing user val seperation
       |user1  user3  0 0 , 4 0 |
       |user2  user3  0 3 , 4 0 |

How can you perform row aggregation in BigQuery to compute pairwise aggregation across rows. As a typical use case, you could compute the euclidean distance between the two users. I want to compute the following metric between the two users:

sum(min(user1_row[i], user2_row[i]) / abs(user1_row[i] - user2_row[i]))

summed over all i for each pair of users.

For example in Python you would simply:

for i in np.arange(row_length/2)]):
dist.append([user1, user2, np.sum(min(r1[i], r2[i]) / abs(r1[i] - r2[i]))])

Solution

  • To start with the ugly way: you could flatten out the math into the query. That is, turn for i in ... sum(min(...)/abs(...)) into SQL operating over each of the fields. Note that MIN and SUM are aggregate functions that you won't want to use. Instead use + for SUM and IF(a < b, a, b) for MIN. ABS(a, b) looks like IF(a < b, b-a, a-b). If you were just computing the Euclidian distance, you could do

    SELECT left.user, right.user, 
      SQRT((left.x-right.x)*(left.x-right.x) 
         + (left.y-right.y)*(left.y-right.y) 
         + (left.z-right.z)*(left.z-right.z)) as dist 
    FROM (
        SELECT * 
        FROM dataset.table1 AS left 
        CROSS JOIN dataset.table1 AS right)
    

    The nicer way is User-Defined Functions, and create the vectors as repeated values. You can then write a DISTANCE() function that performs your computation over the two arrays from the left and the right side of the cross join. If you're not in the UDF beta program and would like to join, please contact google cloud support.

    Finally, if you change your schema from {user:string, field1:float, field2:float, field3:float,...} to {user:string, fields:[field:float]}

    You could then flatten the field with position and do the cross join on that. As in:

    SELECT
      user,
      field,
      index,
    FROM (FLATTEN((
      SELECT  
        user,
        fields.field as field,
        POSITION(fields.field) as index,
      from [dataset1.table1]
    ), fields))
    

    If you save this as a view, call it "dataset1.flat_view"

    Then you can do your join:

    SELECT left.user as user1, right.user as user2,
           left.field as l, right.field as r,
    FROM dataset1.flat_view left
    JOIN dataset1.flat_view right 
    ON left.index = right.index
    WHERE left.user != right.user
    

    This will give you one row each for each pair of users and each field matching field. You can save that as the view "dataset1.joined_view".

    Finally, you can do your aggregations:

    Since you want this:

    sum(min(user1_row[i], user2_row[i]) / abs(user1_row[i] - user2_row[i]))
    

    it would look like:

    SELECT user1, user2, 
        SUM((if (l < r, l, r)) / (if (l > r, l-r, r-l))
    FROM [dataset1.joined_view] 
    GROUP EACH BY user1, user2