Search code examples
apache-sparkapache-spark-sqlpartitioningparquet

How Spark SQL reads Parquet partitioned files


I have a parquet file of around 1 GB. Each data record is a reading from an IOT device which captures the energy consumed by the device in the last one minute. Schema: houseId, deviceId, energy The parquet file is partitioned on houseId and deviceId. A file contains the data for the last 24 hours only.

I want to execute some queries on the data residing in this parquet file using Spark SQL An example query finds out the average energy consumed per device for a given house in the last 24 hours.

Dataset<Row> df4 = ss.read().parquet("/readings.parquet");
df4.as(encoder).registerTempTable("deviceReadings");
ss.sql("Select avg(energy) from deviceReadings where houseId=3123).show();

The above code works well. I want to understand that how spark executes this query.

  1. Does Spark read the whole Parquet file in memory from HDFS without looking at the query? (I don't believe this to be the case)
  2. Does Spark load only the required partitions from HDFS as per the query?
  3. What if there are multiple queries which need to be executed? Will Spark look at multiple queries while preparing an execution plan? One query may be working with just one partition whereas the second query may need all the partitions, so a consolidated plan shall load the whole file from disk in memory (if memory limits allow this).
  4. Will it make a difference in execution time if I cache df4 dataframe above?

Solution

  • Does Spark read the whole Parquet file in memory from HDFS without looking at the query?

    It shouldn't scan all data files, but it might in general, access metadata of all files.

    Does Spark load only the required partitions from HDFS as per the query?

    Yes, it does.

    Does Spark load only the required partitions from HDFS as per the query?

    It does not. Each query has its own execution plan.

    Will it make a difference in execution time if I cache df4 dataframe above?

    Yes, at least for now, it will make a difference - Caching dataframes while keeping partitions