Why I am getting different count results, when I am using 'T' separator for timestamp field in spark-SQL
FYI: Using the data from cassandra tables using dse spark
Datastax version: DSE 5.1.3
Apache Cassandra™ 3.11.0.1855 *
Apache Spark™ 2.0.2.6
DataStax Spark Cassandra Connector 2.0.5 *
scala> val data = spark.sql("select * from pramod.history ").where(col("sent_date") >= "2024-06-11 00:00:00.000Z" && col("sent_date") <= "2027-11-15 00:00:00.000Z")
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tx_id: string, agreement_number: string ... 37 more fields]
scala> data.count()
res21: Long = 181466
scala> val data = spark.sql("select * from pramod.history ").where(col("sent_date") >= "2024-06-11T00:00:00.000Z" && col("sent_date") <= "2027-11-15T00:00:00.000Z")
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [tx_id: string, agreement_number: string ... 37 more fields]
scala> data.count()
res22: Long = 163228
Also, getting different result if I am using cassandraCount() comparitively to the spark-sql
scala> val rdd = sc.cassandraTable("pramod", "history").select("tx_id","sent_date").where("sent_date>='2024-06-11 00:00:00.000Z' and sent_date <='2027-11-15 00:00:00.000Z'")
rdd: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[77] at RDD at CassandraRDD.scala:19
scala> rdd.count()
res20: Long = 181005
scala> rdd.cassandraCount()
res25: Long = 181005
I'm not tested, so not 100% sure, but it could be that because it's trying to use it as a string, not as timestamp - at least I have seen such behaviour with pushing filters downstream. Can you try with something like:
data.filter("ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)")