Search code examples
geolocationgeomesa

Is Geomesa indexes are reliable and remain in sync with main table?


To create indexes, Geomesa creates multiple tables in HBase. I have a few questions:

  1. What Geomesa does to ensure these tables are in sync?
  2. What will be the impact on Geomesa query, if the index tables are not in sync?
  3. What happens (with write calls) if Geomesa is not able to write one of the index tables?
  4. Synchronization between tables are the best effort or Geomesa ensure the availability of data with eventual consistency?

I am planning to use Geomesa with Hbase (backed by S3) combination to store my geospatial data; the data size can grow up to Terabytes to Petabytes.

I am investigating how reliable Geomesa is in terms of synchronization between the primary and index table?

HBase Tables:

catalog1
catalog1_node_id_v4 (Main Table)
catalog1_node_z2_geom_v5 (Index Table)
catalog1_node_z3_geom_lastUpdateTime_v6 (Index Table)
catalog1_node_attr_identifier_geom_lastUpdateTime_v8 (Index Table)

Geomesa Schema

geomesa-hbase describe-schema -c catalog1 -f node

INFO Describing attributes of feature 'node'

key | String
namespace | String
identifier | String (Attribute indexed)
versionId | String
nodeId | String
latitude | Integer longitude | Integer lastUpdateTime | Date (Spatio-temporally indexed)
tags | Map
geom | Point (Spatio-temporally indexed) (Spatially indexed)

User data: geomesa.index.dtg | lastUpdateTime
geomesa.indices | z3:6:3:geom:lastUpdateTime,z2:5:3:geom,id:4:3:,attr:8:3:identifier:geom:lastUpdateTime


Solution

  • GeoMesa does not do anything to sync indices - generally this should be taken care of in your ingest pipeline.

    If you have a reliable feature ID tied to a given input feature, then you can write that feature multiple times without causing duplicates. During ingest, if a batch of features fails due to a transient issue, then you can just re-write them to ensure that the indices are correct.

    For HBase, when you call flush or close on a feature writer, the pending mutations will be sent to the cluster. Once that method returns successfully, then the data has been persisted to HBase. If an exception is thrown, you should re-try the failed features. If there are subsequent HBase failures, you may need to recover write-ahead logs (WALs) as per standard HBase operation.

    A feature may also fail to be written due to validation (e.g. a null geometry). In this case, you would not want to re-try the feature as it will never ingest successfully. If you are using the GeoMesa converter framework, you can pre-validate features to ensure that they will ingest ok.

    If you do not have an ingest pipeline already, you may want to check out geomesa-nifi, which will let you convert and validate input data, and re-try failures automatically through Nifi flows.