Search code examples
pysparkcassandracql

How do I verify in CQL that all the rows have successfully copied from a CSV to a Cassandra table? ***SELECT statements are not returning all results


I am trying to understand Cassandra by playing with a public dataset. I had inserted 1.5M rows from CSV to a table on my local instance of Cassandra, WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
The table was created with one field as a partition key, and one more as primary key

I had a confirmation that 1.5M rows were processed. COPY Completed

But when I run SELECT or SELECT COUNT(*) on the table, I always get a max of 182 rows.  Secondly, the number of records returned with clustered columns seem to higher than single columns which is not making sense to me. What am I missing from Cassandra's architecture and querying point of view.

Lastly I have also tried reading the same Cassandra table from pyspark shell, and it seems to be reading 182 rows too.


Solution

  • Your primary key is PRIMARY KEY (state, severity). With this primary key definition, all rows for accidents in the same state of same severity will overwrite each other. You probably only have 182 different (state, severity) combinations in your dataset.

    You could include another clustering column to record the unique accident, like an accident_id

    This blog highlights the importance of the primary key, and has some examples: https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key