I'm trying to understand what is causing the huge difference in reading speed. I have a dataframe with 30 million rows and 38 columns.
final_df=spark.read.parquet("/dbfs/FileStore/path/to/file.parquet")
This takes 14 minutes to read the file.
While
final_df = spark.read.format("parquet").load("/dbfs/FileStore/path/to/file.parquet")
Takes only 2 seconds to read the file.
spark.read.parquet(filename)
and spark.read.format("parquet").load(filename)
do exactly the same thing.
We can see this in the source code (taking Spark 3.3.2, latest version at the time of this post).
/**
* Loads a Parquet file, returning the result as a `DataFrame`.
*
* Parquet-specific option(s) for reading Parquet files can be found in
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 1.4.0
*/
@scala.annotation.varargs
def parquet(paths: String*): DataFrame = {
format("parquet").load(paths: _*)
}
We see that calling spark.read.parquet(filename)
is actually a kind of "alias" for spark.read.format("parquet").load(filename)
.
Those 2 methods of reading in files are exactly the same. Also, reading in a file with these methods is a lazy transformation, not an action. Only 2 seconds to read in a dataset of 30 million rows and 38 columns is extremely quick, probably too quick (depending on your hardware).
So in your case, one of the following might have happened: