Search code examples
apache-spark-sqlgeospatialdatabricksazure-databricks

Azure Databricks : Geospatial queries with Spark SQL


Currently I have the following:

  • Databricks with table with devices, position and timestamp;
  • Web Api that receives request in minLat, minLon, maxLat, maxLon and creates a sql query with lat between minLat and maxLat and lon between minLon and maxLon;
  • Function that will receive the query generated from Web API and create a JDBC connection with cluster in databricks to execute the query;

I want to see if I can improve the "lat between minLat and maxLat and lon between minLon and maxLon" with some spatial lib. One such example I checked was GeoSpark. Issue here is that the current versions of GeoSpark (and GeoSParkSql) only works with spark v2.3 and no supported runtime in databricks works with that version anymore.

Any ideas what I can do?

Note: I can not deviate from SQL at the moment.


Solution

  • GeoSpark joined the Apache Foundation as Apache Sedona project, and version supporting Spark 3.0 was released around 2 weeks ago, so you can use it the same way as GeoSpark.

    P.S. To automate registration of functions we can create something like this, compile into jar, and then configure Spark with --conf spark.sql.extensions=...SomeExtensions:

    class SomeExtensions extends (SparkSessionExtensions => Unit) {
      def apply(e: SparkSessionExtensions): Unit = {
        e.injectCheckRule(spark => {
          // Setup something
          _ => Unit
        })
      }
    }