Speed difference between spark.read.parquet and spark.read.format.load

I'm trying to understand what is causing the huge difference in reading speed. I have a dataframe with 30 million rows and 38 columns.

final_df=spark.read.parquet("/dbfs/FileStore/path/to/file.parquet")

This takes 14 minutes to read the file.

While

final_df = spark.read.format("parquet").load("/dbfs/FileStore/path/to/file.parquet")

Takes only 2 seconds to read the file.

Solution

spark.read.parquet(filename) and spark.read.format("parquet").load(filename) do exactly the same thing.

We can see this in the source code (taking Spark 3.3.2, latest version at the time of this post).

  /**
   * Loads a Parquet file, returning the result as a `DataFrame`.
   *
   * Parquet-specific option(s) for reading Parquet files can be found in
   * <a href=
   *   "https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option">
   *   Data Source Option</a> in the version you use.
   *
   * @since 1.4.0
   */
  @scala.annotation.varargs
  def parquet(paths: String*): DataFrame = {
    format("parquet").load(paths: _*)
  }

We see that calling spark.read.parquet(filename) is actually a kind of "alias" for spark.read.format("parquet").load(filename).

Conclusion

Those 2 methods of reading in files are exactly the same. Also, reading in a file with these methods is a lazy transformation, not an action. Only 2 seconds to read in a dataset of 30 million rows and 38 columns is extremely quick, probably too quick (depending on your hardware).

So in your case, one of the following might have happened:

Maybe you called an action after the "slow" one, triggering an actual read and did not do that for the "quick" one. The "quick" one did not actually read in the file.
Maybe after the first read, your dataset was cached somehow. This could remove the need of reading in the file again.
...or something else :)