apache-spark pyspark cassandra spark-cassandra-connector

Pulling only required columns in Spark from Cassandra without loading all the columns

Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector

Reading data from ES into Spark -- here only required columns are being brought from ES to Spark :

spark.conf.set('es.nodes', ",".join(ES_CLUSTER))
es_epf_df = spark.read.format("org.elasticsearch.spark.sql") \
        .option("es.read.field.include", "id_,employee_name") \
        .load("employee_0001") \

Reading data from Cassandra into Spark -- here all the columns' data is brought to spark and then select is applied to pull columns of interest :

spark.conf.set('spark.cassandra.connection.host', ','.join(CASSANDRA_CLUSTER))
cass_epf_df = spark.read.format('org.apache.spark.sql.cassandra') \
        .options(keyspace="db_0001", table="employee") \
        .load() \
        .select("id_", "employee_name")

Is it possible to do the same for Cassandra? If yes, then how. If not, then why not.

Solution

Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:

The connector will automatically pushdown all valid predicates to Cassandra. The Datasource will also automatically only select columns from Cassandra which are required to complete the query. This can be monitored with the explain command.

source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md