Search code examples
apache-sparkparquet

Is spark able to read only column values satisfying some condition from parquet file?


I have a code

val count = spark.read.parquet("data.parquet").select("foo").where("foo > 3").count

I'm interested if spark is able to push down filter somehow and read from parquet file only values satisfying where condition. Can we avoid full scan in this case?


Solution

  • Short answer is yes, in this case, but not all cases.

    You can try .explain and see for yourself.

    This is an excellent reference document freely available on the Internet that I learnt a few things from in the past: https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example