Search code examples
javaapache-sparkbigdataspark-streaminghazelcast

Spark RDD filter right after spark streaming


I am using spark streaming and I read streams from Kafka. After reading this stream, I am adding it to hazelcast map.

The issue is, I need to filter values from map right after reading stream from Kafka.

I am using below code to parallelize map values.

List<MyCompObj> list = CacheManager.getInstance().getMyMap().values().stream().collect(Collectors.toList());
JavaRDD<MyCompObj> myObjRDD = sparkContext.parallelize(list);

But in this logic, I am using JavaRDD in another one which is JavaInputDStream.foreachRDD and this causes serialization issues.

First question is, how can I run my spark job by event driven?

On the other hand, I just want to get some opinion about scheduled spark jobs. what is the best practice to schedule a spark job to execute it in specific time(s)?


Solution

  • I solved my issues by separating streaming and batch processing in two parts as it must be.

    I am using quartz and SparkLauncher to trigger a new job (example)