Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.

The schema is as follows:

CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
    username text,

    entry_type int,

    entry_id text,

    PRIMARY KEY ((username, entry_type), entry_id)
);

In order to query the table we need the partition keys, the first part of the primary key. Hence, if we know the username and the entry_type, we can query the table.

In this case the username can be whatever, but the entry_type is an integer in the range 0-9.

When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.

We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.

But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:

SELECT * FROM ${keyspace}.entries LIMIT 1

as cassandra requires the partition keys to make meaningful queries.

What can I do to figure out what is left in our table?

Solution

To answer your first question, I would like to put more light on gc_grace_seconds property.

In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Now possibly you are looking into table size before gc_grace_seconds and data is still there.

Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.

https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/

https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html

I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.