I have a usecase to generate a surrogate key (unique and increment by 1) for each record I am inserting into Hive table with Spark Streaming program. The key can never be repeated even the program restarts.
Based on my research this is not possible to implement in spark streaming as executors will run on different nodes.
Is there any way to implement this??
Spark Batch
Use RDD.zipWithIndex() to set an index for each row.
Spark Streaming
At the beginning of every batch, get the max key of last batch and run the codes like this:
val n = lastBatchMaxKey()
df.rdd.zipWithIndex().map(xx => {
val (row, idx) = (xx._1, xx._2)
val key = idx + n // this is the key
})