I'm currently working in BigQuery. I have two tables:
Table A {
id int
b_id int
date timestamp
}
Table B {
id int
long float
lat float
date timestamp
}
For one b_id
in A
there are multiple entries in B
. I want to efficiently join both tables and only use the entry in B
where A.b_id = B.id
and the difference between the timestamps is minimal.
My current solution is:
SELECT A.*, B.*, ABS(DATETIME_DIFF(DATETIME(A.date), DATETIME(B.date), SECOND)) as timeDiff
FROM A
INNER JOIN B
ON A.b_id = B.id
WHERE ABS(DATETIME_DIFF(DATETIME(A.date), DATETIME(B.date), SECOND)) =
(
SELECT MIN(ABS(DATETIME_DIFF(DATETIME(B2.date), DATETIME(A2.date), SECOND)))
FROM B as B2
INNER JOIN A as A2
ON B2.driverId = A2.driverId
GROUP BY A2.driverId, B2.rideId
)
Problem is that it takes way to much time for a single entry. Any suggestions on how to improve the query?
You could use window functions:
select *
from (
select a.*, b.*,
row_number() over(
partition by a.b_id
order by abs(datetime_diff(datetime(a.date), datetime(b.date), second))
) rn
from a
inner join b on a.b_id = b.id
) t
where rn = 1