Search code examples
sqlgoogle-bigquerydistanceprobability-distribution

One-dimensional earth mover's distance in BigQuery/SQL


Let P and Q be two finite probability distributions on integers, with support between 0 and some large integer N. The one-dimensional earth mover's distance between P and Q is the minimum cost you have to pay to transform P into Q, considering that it costs r*|n-m| to "move" a probability r associated to integer n to another integer m.

There is a simple algorithm to compute this. In pseudocode:

previous = 0
sum = 0
for i from 0 to N:
    previous = P(i) - Q(i) + previous
    sum = sum + abs(previous)         // abs = absolute value
return sum

Now, suppose you have two tables that contain each a probability distribution. Column n contains integers, and column p contains the corresponding probability. The tables are correct (all probabilities are between 0 and 1, their sum is I want to compute the earth mover's distance between these two tables in BigQuery (Standard SQL).

  1. Is it possible? I feel like one would need to use analytical functions, but I don't have much experience with them, so I don't know how to get there.
  2. What if N (the maximum integers) is very large, but my tables are not? Can we adapt the solution to avoid doing a computation for each integer i?

Solution

  • I adapted Michael's answer to fix its issues, here's the solution I ended up with. Suppose the integers are stored in column i and the probability in column p. First I join the two tables, then I compute EMD(i) for all i using the window, then I sum all absolute values.

    WITH
    joined_table AS (
      SELECT
        IFNULL(table1.i, table2.i) AS i,
        IFNULL(table1.p, 0) AS p,
        IFNULL(table2.p, 0) AS q,
      FROM table1
      OUTER JOIN table2
      ON table1.i = table2.i
    ),
    aggr AS (
      SELECT
        (SUM(p-q) OVER win) * (i - (LAG(i,1) OVER win)) AS emd
      FROM joined_table
      WINDOW win AS (
        ORDER BY i
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
      )
    )
    SELECT SUM(ABS(emd)) AS total_emd
    FROM aggr