Reading through the documentation (http://snappydatainc.github.io/snappydata/streamingWithSQL/) and had a question about this item:
"Reduced shuffling through co-partitioning: With SnappyData, the partitioning key used by the input queue (e.g., for Kafka sources), the stream processor and the underlying store can all be the same. This dramatically reduces the need to shuffle records."
If we are using Kafka and partition our data in a topic using a key (single value). Is it possible to map this single key from kafka to multiple partition keys identified in the snappy table?
Is there a hash of some sort to turn multiple keys into a single key?
The benefit of reduced shuffling seems significant and trying to understand the best practice here.
thanks!
With DirectKafka stream, each partition pulls the data from own designated topic. If no partitioning is specified for the storage table, then each DirectKafka partition will put only to local storage buckets and then everything will line up well without requiring anything extra. The only thing to take care of is enough number of topics (thus partitions) for better concurrency -- ideally at least as many as total number of processor cores in the cluster so all cores are busy.
When partitioning storage tables explicitly, SnappyData's store has been adjusted to use the same hashing as Spark's HashPartitioning (for "PARTITION_BY" option of both column and row tables) since that is the one used at Catalyst SQL execution layer. So execution and storage are always collocated. However, aligning that with ingestion from DirectKafka partitions will require some manual work (align kafka topic partitioning with HashPartitioning, then having the preferred locations for each DirectKafka partition match the storage). Will be simplified in coming releases.