I have a code
val count = spark.read.parquet("data.parquet").select("foo").where("foo > 3").count
I'm interested if spark is able to push down filter somehow and read from parquet file only values satisfying where
condition. Can we avoid full scan in this case?
Short answer is yes, in this case, but not all cases.
You can try .explain and see for yourself.
This is an excellent reference document freely available on the Internet that I learnt a few things from in the past: https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example