Search code examples
apache-sparkcassandracassandra-3.0spark-cassandra-connector

spark cassandra connector missing data while reading back


I am writing data with 3000000 rows and 8 columns to cassandra using spark cassandra connector(python) and when i read back i am only getting 50000 rows. when i check number of rows in cqlsh there also number of rows in 50000 only where is my data going is there a issue with spark -cassandra connector?

this is my spark config

spark = SparkSession.builder.appName("das_archive").config(
"spark.driver.memory", "25g").config('spark.cassandra.connection.host',
                                     '127.0.0.1').config(
'spark.jars.packages',
'datastax:spark-cassandra-connector:2.4.0-s_2.11')

write

 df.write.format("org.apache.spark.sql.cassandra").mode('append').options(
    table='shape1', keyspace="shape_db1").save(

read

 load_options = {"table": "shape1", "keyspace": "shape_db1",
                "spark.cassandra.input.split.size_in_mb": "1000",
                'spark.cassandra.input.consistency.level': "ALL"}
data_frame = spark.read.format("org.apache.spark.sql.cassandra").options(
    **load_options).load()

Solution

  • The most probable cause for that is that you don't have correct primary key - as result, the data is overwritten. You need to make sure that every row of input data is uniquely identified by the set of the columns.

    P.S. If you're just writing data that is stored in something like CSV, you can look to tool like DSBulk that is heavily optimized for loading/unloading data to/from Cassandra.