java apache-spark cassandra spark-cassandra-connector

Spark java filter isin method or something else?

I have around 2 billions of rows in my cassandra database which I filter with the isin method based on an experimentlist with 4827 Strings, as shown below. However, I noticed that after the distinct command I have only 4774 unique rows. Any ideas why 53 are missing? Does the isin method has a threshold/limitations? I have double and triple checked the experimentlist, it does have 4827 Strings, and also the other 53 strings do exist in the database as I can query them with cqlsh. Any help much appreciated!

Dataset<Row> df1 = sp.read().format("org.apache.spark.sql.cassandra")
                .options(new HashMap<String, String>() {
                    {
                        put("keyspace", "mdb");
                        put("table", "experiment");
                    }
                })
                .load().select(col("experimentid")).filter(col("experimentid").isin(experimentlist.toArray()));
List<String> tmplist=df1.distinct().as(Encoders.STRING()).collectAsList();
   
System.out.println("tmplist "+tmplist.size());

Solution

Regarding the actual question about "missing data" - there could be problems when your cluster has missing writes, and repair isn't done regularly. Spark Cassandra Connector (SCC) reads data with consistency level LOCAL_ONE, and may hit nodes without all data. You can try to set consistency level to LOCAL_QUORUM (via --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM), for example, and repeat the experiment, although it's better to make sure that data is repaired.

Another problem that you have is that you're using the .isin function - it's translating into a query SELECT ... FROM table WHERE partition_key IN (list). See the execution plan:

scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._
scala> val data = spark.read.cassandraFormat("m1", "test").load()
data: org.apache.spark.sql.DataFrame = [id: int, m: map<int,string>]

scala> data.filter($"id".isin(Seq(1,2,3,4):_*)).explain
== Physical Plan ==
*Scan org.apache.spark.sql.cassandra.CassandraSourceRelation [id#169,m#170] PushedFilters: [*In(id, [1,2,3,4])], ReadSchema: struct<id:int,m:map<int,string>>

This query is very inefficient, and put an additional load to the node that performs query. In the SCC 2.5.0, there are some optimizations around that, but it's better to use so-called "Direct Join" that was also introduced in the SCC 2.5.0, so SCC will perform requests to specific partition keys in parallel - that's more effective and put the less load to the nodes. You can use it as following (the only difference that I have it as "DSE Direct Join", while in OSS SCC it's printed as "Cassandra Direct Join"):

scala> val toJoin = Seq(1,2,3,4).toDF("id")
toJoin: org.apache.spark.sql.DataFrame = [id: int]

scala> val joined = toJoin.join(data, data("id") === toJoin("id"))
joined: org.apache.spark.sql.DataFrame = [id: int, id: int ... 1 more field]

scala> joined.explain
== Physical Plan ==
DSE Direct Join [id = id#189] test.m1 - Reading (id, m) Pushed {}
+- LocalTableScan [id#189]

This direct join optimization needs to be explicitly enabled as described in the documentation.