Problem:
I have 2 tables (d1 & d2) containing Geo-spatial points. I want to carry out the following query:
select * from table 1 where table1.point is within 50km of any point in table2.point
I am using Spark-SQL
with GeoMesa
& Accumulo
to achieve the same. (Spark as processing engine, Accumulo as Data Store & GeoMesa for GeoSpatial libraries).
The above query is kind of left semi join
but I am not sure on how to achieve it using Spark-SQL
because as far as I have read subqueries can't be used in where clause.
Was able to achieve this using:
select * from d1 left semi join d2 on st_contains(st_bufferPoint(d1.point, 10000.0), d2.point)
Spark broadcasted d2 & is carrying out joins but it is still taking more time as the size of d1 is 5 billion & d2 is 10 million.
Not sure though if there is any more efficient way to achieve the same.