Search code examples
apache-spark-sqlgeospatialgeomesa

Left Semi Join on Geo-Spatial tables in Spark-SQL & GeoMesa


Problem:

I have 2 tables (d1 & d2) containing Geo-spatial points. I want to carry out the following query:

select * from table 1 where table1.point is within 50km of any point in table2.point

I am using Spark-SQL with GeoMesa & Accumulo to achieve the same. (Spark as processing engine, Accumulo as Data Store & GeoMesa for GeoSpatial libraries).

The above query is kind of left semi join but I am not sure on how to achieve it using Spark-SQL because as far as I have read subqueries can't be used in where clause.


Solution

  • Was able to achieve this using:

    select * from d1 left semi join d2 on st_contains(st_bufferPoint(d1.point, 10000.0), d2.point)
    

    Spark broadcasted d2 & is carrying out joins but it is still taking more time as the size of d1 is 5 billion & d2 is 10 million.

    Not sure though if there is any more efficient way to achieve the same.