Search code examples
apache-sparkpysparkdatabricksspark-structured-streaming

How to load a large file in pySpark and then process it efficiently?


I have a large file stored. I want to load and process this file in Databricks (pyspark). But since the file size is large, it will be inefficient to load the whole file at once and then process it. So I want to load this file in parts and then process it simultaneously while the next part is loading. So how can I read this file in parts? One idea I thought was of using structured streaming. But in this also, the whole file is getting loaded in single batch. So how to load it into multiple batches?


Solution

  • You cannot have Spark avoid scanning the entire data if it all sits in a single file.

    While reading the data, Spark will split the data into partitions based on the configuration spark.sql.files.maxPartitionBytes which defaults to 128MB. Based on the resulting number of partitions together with the available amount of cores in your Spark cluster, the data will be processed in parallel.