I am becoming familiar with the basics of both Python, Spark, and PySpark by following this Spark By Examples tutorial (among others at the same site). At the very outset, they provide three ways to read in the same file:
spark.read.csv("/tmp/resources/zipcodes.csv")
spark.read.format("csv") \
.load("/tmp/resources/zipcodes.csv")
spark.read.format("org.apache.spark.sql.csv") \
.load("/tmp/resources/zipcodes.csv")
Here, spark
is an object of class
pyspark.sql.session.SparkSession
. The lesson says that the 2nd and
3rd commands are alternatives the the 1st, but for "fully qualified
data source name". Unfortunately, the doc strings in PySpark are
extremely spartan. Fully qualified paths are used in all
three examples, however, so the explanation for the spark.read.format
commands seems very incomplete.
What are the differences between the method calls? It seems odd
to me that a whole new dedicated csv
method is needed to deal
specifically with CSV -- unless it is just a wrapper for the format
method with CSV-specific conveniences.
Context: I am using Spark 3.4.1. The version was chosen by Anaconda's package manager on Windows 10 based on a Python 3.9 environment.
What I have found
One fulsome page I found is this SaturnCloud
page,
but I'm a puzzled by explanation that the format
method is more generic and
slower. I can't see that being the case if the csv
method is a
wrapper -- unless the ingestor is set up in a highly suboptimal fashion
with lots of control flow on a per-record, per-field, or per-character
basis.
The same site also refers to the csv
method as a "shorthand" for the
format("csv")
. This suggests that it doesn't even provide any
additional functionality that a wrapper might, and it shouldn't be any
slower at all. So the site is self-contradictory.
This
page
refers to the csv
method as a "shortcut" for format("csv")
.
Again, this gives a sense that it is a thin wrapper, but that is not
consistent with SaturnCloud's indication that there could be
performance differences, nor Spark By Examples's implication that
they are for different forms of the data source name.
The question as to the differences has been posed as a Stack Overflow comment before.
Let's have a look at the source code to uncover this mystery! I'm assuming you're on Spark v3.5.0, the latest at the time of writing this post.
If we have a look at DataFrameReader.scala
's csv
method, we see the following:
@scala.annotation.varargs
def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
This shows us that indeed, doing spark.read.csv()
and doing spark.read.format("csv").load()
does exactly the same. There should be no difference in performance.
What about spark.read.format("org.apache.spark.sql.csv").load()
?
I had never seen this before, so I decided to try it out in a pyspark shell:
>>> df = spark.read.format("org.apache.spark.sql.csv").load("random_file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.csv.DefaultSource
...
This does not work, I get a ClassNotFoundException
.
After some digging, I found this Map in the source code, which essentially maps the fully qualified data source names to their shorthand (which we've been doing until now). The essential line is this one:
"com.databricks.spark.csv" -> csv,
So then I tried that in a Spark shell:
>>> df1 = spark.read.format("com.databricks.spark.csv").load("random_file.csv")
and that worked!!
spark.read.csv()
and spark.read.format("csv").load()
do exactly the same thing, the former being a very thin wrapper around the latterorg.apache.spark.sql.csv
is not the correct fully qualified data source name for CSV files: it is com.databricks.spark.csv