I'm using DB Connect 9.1.9. My cluster version is 9.1LTS with a single node (for test purposes).
My data is stored on a S3 as a delta table.
Running the following:
df = spark.sql("select * from <table_name> where runDate >= '2022-01-10 14:00:00' and runDate <= '2022-01-10 15:00:00'")
When I run it with DB Connect I get: 31. When I run it on a Databricks Workspace: 462.
Of course you can't check that numbers, I just wanted to find why do we have a difference.
If I remove the condition on the runDate, I have good results on both platform. So I deduced that it was the "runDate" fault, but I can't find why.
The schema:
StructType(List(StructField(id,StringType,False),
StructField(runDate,TimestampType,true)))
I have the same explain plan on both platform too.
Did I miss something about Timestamp usage ?
Update 1: it is funny, when I'm putting the count() inside the spark.sql("SELECT count(*) ...")
directly I still have 31 rows. It might be the way db-connect translate the query to the cluster.
The problem was the timezone associated to the Spark Session.
Add this after your spark session declaration (in case your dates are stored in UTC):
spark.conf.set("spark.sql.session.timeZone", "UTC")