Search code examples
javaapache-sparkrdd

Can not modify value in JavaRDD


I have a question about how to update JavaRDD values.

I have a JavaRDD<CostedEventMessage> with message objects containing information about to which partition of kafka topic it should be written to.

I'm trying to change the partitionId field of such objects using the following code:

rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));

where the repartitionEvent logic is:

costedEventMessage.setPartitionId(1);
return costedEventMessage;

But the modification does not happen.

Could you please advice why and how to correctly modify values in a JavaRDD?


Solution

  • Spark is lazy, so from the code you pasted above it's not clear if you actually performed any action on the JavaRDD (like collect or forEach) and how you came to the conclusion that data was not changed.

    For example, if you assumed that by running the following code:

    List<CostedEventMessage> messagesLst = ...;
    JavaRDD<CostedEventMessage> rddToKafka = javaSparkContext.parallelize(messagesLst);
    rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));
    

    Each element in messagesLst would have partition set to 1, you are wrong. That would hold true if you added for example:

    messagesLst = rddToKafka.collect();
    

    For more details refer to documentation