Search code examples
javaapache-sparkcassandraspark-cassandra-connector

How To Run Multiple Spark Cassandra Query


I need to run such a task below. Somehow, I am missing a point. I know, I cannot use javasparkcontext like this and pass javafunctions since there is serialization problem.

I need to run multiple cassandra queries at size cartesian.size(). Is there any advice?

JavaSparkContext jsc = new JavaSparkContext(conf);
    JavaRDD<DateTime> dateTimeJavaRDD = jsc.parallelize(dateTimes); //List<DateTime>
    JavaRDD<Integer> virtualPartitionJavaRDD = jsc.parallelize(virtualPartitions); //List<Integer>
    JavaPairRDD<DateTime, Integer> cartesian = dateTimeJavaRDD.cartesian(virtualPartitionJavaRDD);

    long c = cartesian.map(new Function<Tuple2<DateTime, Integer>, Long>() {
        @Override
        public Long call(Tuple2<DateTime, Integer> tuple2) throws Exception {
            return javaFunctions(jsc).cassandraTable("keyspace", "table").where("p1 = ? and  p2 = ?", tuple2._1(), tuple2._2()).count();
        }
    }).reduce((a,b) -> a + b);


    System.out.println("TOTAL ROW COUNT IS: " + c);

Solution

  • The correct solution should be to perform join between your data, and Casasndra table. There is joinWithCassandraTable function that is doing what you need - you just generate RDD of Tuple2 that contains values for p1 & p2, and then call joinWithCassandra table, something like this (not tested, adopted from my example here):

    JavaRDD<Tuple2<Integer, Integer>> trdd = cartesian.map(new Function<Tuple2<DateTime, Integer>, Tuple2<Integer, Integer>>() {
            @Override
            public Tuple2<Integer, Integer> call(Tuple2<DateTime, Integer> tuple2) throws Exception {
                return new Tuple2<Integer, Integer>(tuple2._1(), tuple2._2());
            }
        });
    CassandraJavaPairRDD<Tuple2<Integer, Integer>, Tuple2<Integer, String>> joinedRDD =
         trdd.joinWithCassandraTable("test", "jtest",
         someColumns("p1", "p2"), someColumns("p1", "p2"),
         mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));
    // perform counting here...