Search code examples
apache-sparkspark-streaming

Surrogate key in Spark batch or Streaming


I have a usecase to generate a surrogate key (unique and increment by 1) for each record I am inserting into Hive table with Spark Streaming program. The key can never be repeated even the program restarts.

Based on my research this is not possible to implement in spark streaming as executors will run on different nodes.

Is there any way to implement this??


Solution

  • Spark Batch

    Use RDD.zipWithIndex() to set an index for each row.

    Spark Streaming

    1. At the end of every batch, find the max key and store it into a persistent database.
    2. At the beginning of every batch, get the max key of last batch and run the codes like this:

      val n = lastBatchMaxKey()
      df.rdd.zipWithIndex().map(xx => {
      val (row, idx) = (xx._1, xx._2)
      val key = idx + n // this is the key
      })