I am trying to test geomesa cassandra backend.
I have ingested a ~2M points from OSM and send DWITHIN
and BBOX
queries to cassandra using geomesa with geotools ecql.
Then I've done some performance tests, the results do not look reasonable for me.
Cassandra is installed to linux machine with 16 cores xeon, 32GB RAM and 1 SSD drive. I got ~150
queries per second.
I started to investigate geomesa execution plan for my queries.
Trace logs coming from org.locationtech.geomesa.index.utils.Explainer
were really helpful, they do a great job explaining what is going on.
What looks confusing to me is the number of range scans that go though cassandra.
For example, I see the following in my logs:
Table: osm_poi_a7_c_osm_5fpoi_5fa7_attr_v2
Ranges (49): SELECT * FROM ..
The number 49
means the actual number of range scans sent to cassandra.
Different queries give me different results, they vary approximately from ~10 to ~130.
10
looks quite reasonable to me but 130
looks enormous.
Could you please explain what causing geomesa to send so huge amount of range scans?
Is there any way to decrease the number of range scans?
Maybe there are some configuration options?
Are there other options? like descreasing the presicion of z-index to improve such queries?
Thanks anyway!
In general, GeoMesa uses common query planning algorithms among its various back-end implementations. The default values are tilted more towards HBase and Accumulo, which support scans with large numbers of ranges. However, there are various knobs you can use to modify the behavior.
You can reduce the number of ranges that are generated at runtime through the system property geomesa.scan.ranges.target
(see here). Note that this will be a rough upper limit, so you will generally get more ranges than specified.
When creating your simple feature type schema, you can also disable sharding, which defaults to 4. The number of ranges generated will be multiplied by the number of shards. See here and here.
If you are querying multiple 'time bins' (weeks by default), then the number of ranges will be multiplied by the number of time bins you are querying. You can set this to a longer interval when creating your schema; see here.
Thanks,