Search code examples
apache-sparkpysparkcassandraspark-cassandra-connector

Pyspark: Incorrect date when save into Cassandra


I am new to Spark and Cassandra. I prepared a data frame and saved it into Cassandra. The dates between Pyspark and Cassandra are not consistent.

Pyspark shows '2020-10-04'. Cassandra shows '2020-10-03 16:00:00.000000+0000'.

Any idea how to resolve it?

In Pyspark,

>>> df_check.show()
+-----------+-----+
|access_date|count|
+-----------+-----+
| 2020-10-04|  120|
+-----------+-----+

>>> df_check.printSchema()
root
 |-- access_date: date (nullable = true)
 |-- count: long (nullable = false)

In Cassandra,

cqlsh:test> select * from test2;

 access_date                     | count
---------------------------------+-------
 2020-10-03 16:00:00.000000+0000 |   120


cqlsh:test> desc test2;

CREATE TABLE test.test2 (
    access_date timestamp PRIMARY KEY,
    count int
)

Solution

  • Cassandra stores the timestamp in UTC, and in your case you're converting the date that Spark is showing date/timestamp in your local timezone (UTC+8) to the UTC. You can try to workaround that by setting spark.sql.session.timeZone to UTC, so Spark will work with dates in that time zone, but that may change your data, especially if you're using functions like current_date/current_timestamp.

    P.S. You can read more about handling of date/timestamp in the Spark in the following blog post.