Search code examples
pysparkoptimizationdatabricksazure-databricksparquet

Reading all the .parquet partitions is slower than reading the full .parquet at a once? (Databricks)


I daily generate and save a dataset partitioned by the field 'Hotel' and I was wondering to know if reading a partitioned file is slower that reading the same file without partitions.

Lets, say that my main first parquet file is (without partitioning):

  • main_bookings.parquet Then I also have, as I said, a file with its. For example, one partition is:
  • main_bookings_partitioned.parquet/Hotel=Barcelona_1
  • main_bookings_partitioned.parquet/Hotel=Barcelona_2
  • main_bookings_partitioned.parquet/Hotel=Barcelona_3
  • main_bookings_partitioned.parquet/Hotel=Madrid_1
  • main_bookings_partitioned.parquet/Hotel=...

I was wondering to know if, from Databricks and using PySpark, reading the main_bookings.parquet (without partitions) is much faster than reading main_bookings_partitioned (the same file but with partitions)

Thank you in advance,

I would like to get a response of the solution and understand why this is happening.


Solution

  • Usually, partitioning can improve the performance of reading and writing data in Spark, especially when dealing with large datasets.

    When you partition a dataset, you divide it into smaller, more manageable parts based on the values of one or more columns. This allows Spark to read or write only the relevant partitions, rather than scanning the entire dataset.

    However, when you partition data, depending on the number of tasks and the size of the data, a large number of small files could be created per partition, which could impact performance.

    To optimize the number of partitions and file size, you can use the repartition function.

    df_partitioned_optimized = df_partitioned.repartition(4)
    

    To optimize the file size:

    df.write.option("compression", "snappy").parquet("/FileStore/tables/_01_dilip/new2_main_bookings_partitioned.parquet")
    
    spark.sql.files.maxRecordsPerFile 1000000
    

    The command spark.sql.files.maxRecordsPerFile can be used to control the maximum number of records per file when writing data. It specifies the maximum size of each output file in terms of the number of records.

    from pyspark.sql import SparkSession
    start_time_unpartitioned = time.time()
    df_unpartitioned = spark.read.parquet("/FileStore/tables/_01_dilip/main_bookings.parquet")
    end_time_unpartitioned = time.time()
    time_taken_unpartitioned = end_time_unpartitioned - start_time_unpartitioned
    start_time_partitioned = time.time()
    df_partitioned = spark.read.parquet("/FileStore/tables/_01_dilip/new1_main_bookings_partitioned.parquet")
    end_time_partitioned = time.time()
    time_taken_partitioned = end_time_partitioned - start_time_partitioned
    print("Time taken for reading the unpartitioned file:", time_taken_unpartitioned, "seconds")
    print("Time taken for reading the partitioned file:", time_taken_partitioned, "seconds")
    

    Results:

    Time taken for reading the unpartitioned file: 0.2050321102142334 seconds
    Time taken for reading the partitioned file: 0.1596813201904297 seconds