apache-spark pyspark apache-spark-sql spark-structured-streaming

Read Partition Data From S3 Bucket

I am using spark for reading data from s3 bucket below are my code.

Dataset<Row> dataset = spark.read().parquet("s3a://my_bucket/backup/backup/year=2023/month=12/day=20/hour=0");

Above code is running fast, but below code is taking so much time. Is below code load whole data first. Please guide me.

Dataset<Row> dataset = spark.read().parquet("s3a://my_bucket/backup/backup/");

dataset.createOrReplaceTempView("my_data);
dataset = spark.sql("select * from mmi_data where year=2023 and month=12 and day=20 and hour=0");
dataset.show();

Solution

Your second query does not load the whole data first but needs to discover the files you are about to load later. If you have lots of files and partitions, this file indexing step can take a considerable amount of time.

The way how you can speed up this step is to use special pattern matching characters to match only the objects you need (instead of all the objects in your base path). Here you can find a list of supported characters: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus-org.apache.hadoop.fs.Path-