Search code examples
csvapache-sparkpyspark

Differences in the 2 ways to read CSV into Spark DataFrame?


I am becoming familiar with the basics of both Python, Spark, and PySpark by following this Spark By Examples tutorial (among others at the same site). At the very outset, they provide three ways to read in the same file:

spark.read.csv("/tmp/resources/zipcodes.csv")
spark.read.format("csv") \
                  .load("/tmp/resources/zipcodes.csv")
spark.read.format("org.apache.spark.sql.csv") \
                  .load("/tmp/resources/zipcodes.csv")

Here, spark is an object of class pyspark.sql.session.SparkSession. The lesson says that the 2nd and 3rd commands are alternatives the the 1st, but for "fully qualified data source name". Unfortunately, the doc strings in PySpark are extremely spartan. Fully qualified paths are used in all three examples, however, so the explanation for the spark.read.format commands seems very incomplete.

What are the differences between the method calls? It seems odd to me that a whole new dedicated csv method is needed to deal specifically with CSV -- unless it is just a wrapper for the format method with CSV-specific conveniences.

Context: I am using Spark 3.4.1. The version was chosen by Anaconda's package manager on Windows 10 based on a Python 3.9 environment.

What I have found

One fulsome page I found is this SaturnCloud page, but I'm a puzzled by explanation that the format method is more generic and slower. I can't see that being the case if the csv method is a wrapper -- unless the ingestor is set up in a highly suboptimal fashion with lots of control flow on a per-record, per-field, or per-character basis.

The same site also refers to the csv method as a "shorthand" for the format("csv"). This suggests that it doesn't even provide any additional functionality that a wrapper might, and it shouldn't be any slower at all. So the site is self-contradictory.

This page refers to the csv method as a "shortcut" for format("csv"). Again, this gives a sense that it is a thin wrapper, but that is not consistent with SaturnCloud's indication that there could be performance differences, nor Spark By Examples's implication that they are for different forms of the data source name.

The question as to the differences has been posed as a Stack Overflow comment before.


Solution

  • Let's have a look at the source code to uncover this mystery! I'm assuming you're on Spark v3.5.0, the latest at the time of writing this post.

    If we have a look at DataFrameReader.scala's csv method, we see the following:

    @scala.annotation.varargs
    def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
    

    This shows us that indeed, doing spark.read.csv() and doing spark.read.format("csv").load() does exactly the same. There should be no difference in performance.

    What about spark.read.format("org.apache.spark.sql.csv").load()?

    I had never seen this before, so I decided to try it out in a pyspark shell:

    >>> df = spark.read.format("org.apache.spark.sql.csv").load("random_file.csv")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
        ...
            at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587)
            at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
        ...
    Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.csv.DefaultSource
        ...
    

    This does not work, I get a ClassNotFoundException.

    After some digging, I found this Map in the source code, which essentially maps the fully qualified data source names to their shorthand (which we've been doing until now). The essential line is this one:

    "com.databricks.spark.csv" -> csv,
    

    So then I tried that in a Spark shell:

    >>> df1 = spark.read.format("com.databricks.spark.csv").load("random_file.csv")
    

    and that worked!!

    Conclusion

    • spark.read.csv() and spark.read.format("csv").load() do exactly the same thing, the former being a very thin wrapper around the latter
    • org.apache.spark.sql.csv is not the correct fully qualified data source name for CSV files: it is com.databricks.spark.csv