I'm new to geospatial domain and I've managed to add geomesa-spark-jst
to the project which enabled me use geospatial functions.
I need to go through milions of geocoded events (eventRdd
) and based on a custom criteria see if they are within a certain distance from a road segment linestring (roadSegmentRdd
).
Currently for each event I need to go through the entire roadSegmentRdd
and see if the criteria is satistfied which is not optimal at all.
How can I use geomesa and indexes to make this query faster? What are the minimum needed dependencies?
Typically, you would want to ingest at least your point data into a GeoMesa data store, which you could then query based on spatial predicates to efficiently filter down to the ones you are interested in.
GeoMesa has several different data store options you could use, from a fully distributed database like HBase to a lightweight file-system-based solution. The best one will depend on your performance requirements and available infrastructure. There is more information about the different data stores here, and Spark specific details here.
Once you have the data ingested, you could try one of the join approaches outlined here or here, depending on the size of your road segment RDD.