Search code examples
cassandraspark-streamingspark-cassandra-connector

How to set variables in "Where" clause when reading cassandra table by spark streaming?


I'm doing some statistics using spark streaming and cassandra. When reading cassandra tables by spark-cassandra-connector and make the cassandra row RDD to a DStreamRDD by ConstantInputDStream, the "CurrentDate" variable in where clause still stays the same day as the program starts.

The purpose is to analyze the total score by some dimensions till current date, but now the code runs analysis just till the day it start running. I run the code in 2019-05-25 and data inserted into table after that time cannot be take in.

The code I use is like below:

  class TestJob extends Serializable {

  def test(ssc : StreamingContext) : Unit={

    val readTableRdd = ssc.cassandraTable(Configurations.getInstance().keySpace1,Constants.testTable)
      .select(
        "code",
        "date",
        "time",
        "score"
      ).where("date<= ?",new Utils().getCurrentDate())

    val DStreamRdd = new ConstantInputDStream(ssc,readTableRdd)

    DStreamRdd.foreachRDD{r=>
    //DO SOMETHING
    }
  }
}

      object GetSSC extends Serializable {
      def getSSC() : StreamingContext ={
        val conf = new SparkConf()
          .setMaster(Configurations.getInstance().sparkHost)
          .setAppName(Configurations.getInstance().appName)
          .set("spark.cassandra.connection.host", Configurations.getInstance().casHost)
          .set("spark.cleaner.ttl", "3600")
          .set("spark.default.parallelism","3")
          .set("spark.ui.port","5050")
          .set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
        val sc = new SparkContext(conf)
        sc.setLogLevel("WARN")
        @transient lazy val ssc = new StreamingContext(sc,Seconds(30))
        ssc
      }
    }

    object Main {
    val logger : Log = LogFactory.getLog(Main.getClass)
    def main(args : Array[String]) : Unit={
    val ssc = GetSSC.getSSC()
    try{
      new TestJob().test(ssc)
      ssc.start()
      ssc.awaitTermination()
    }catch {
      case e : Exception =>
        logger.error(Main.getClass.getSimpleName+"error : 
    "+e.printStackTrace())
    }
  }
}

Table used in this Demo like:

    CREATE TABLE test.test_table (
       code text PRIMARY KEY, //UUID
       date text, // '20190520'
       time text, // '12:00:00'
       score int); // 90

Any help is appreciated!


Solution

  • In general, RDDs that are returned by Spark Cassandra Connector aren't the streaming RDDs - there is no such functionality in Cassandra that will allow to subscribe to the changes feed and analyze it. You can implement something like by explicitly looping and fetching the data, but it will require careful design of the tables, but it's hard to say something without digging more deeply into requirements for latency, amount of data, etc.