Can someone explain and provide the document that explains the behavior of
select * from <keyspace.table>
Let's assume I have 5 node cluster, how does Cassandra DataStax Driver behave when such queries are being issued? (Fetchsize was set to 500)
Is this a proper way to pull data ? Does it cause any performance issues?
No, that's really a very bad way to pull data. Cassandra shines when it fetches the data by at least partition key (that identifies a server that holds the actual data). When you are doing the select * from table
, request is sent to coordinating node, that will need to pull all data from all servers and send via that coordinating node, overloading it, and most probably lead to the timeout if you have enough data in the cluster.
If you really need to perform full fetch of the data from cluster, it's better to use something like Spark Cassandra Connector that read data by token ranges, fetching the data directly from nodes that are holding the data, and doing this in parallel. You can of course implement the token range scan in Java driver, something like this, but it will require more work on your side, comparing to use of Spark.