Search code examples
csvapache-sparkpyspark

Functional behaviour of Spark's DataFrameReader.csv method?


I am spinning up on Python, Spark, and PySpark. I am following this tutorial on reading all CSV files in a specified directory, i.e., spark.read.csv(DirectoryPath). I tried to find where the description of which files are read, e.g., all files regardless of extension or just files with the CSV extension. I consider it to be a very basic expectation for the documentation to describe this behaviour. The spark.read method returns a DataFrameReader object, of which csv is a method.

The doc string from entering spark.read.csv? at the Spyder IPython console only seems to describe the case where a CSV file path is provided, not where a directory path is given. The PySpark documentation says essentially the same thing.

Since Spark is written in Scala, I sometimes found the Scala documentation better. However, none of the descriptions for the three csv methods seem to describe the case where the path argument is a directory rather than a file.

What is the functional description of DataFrameReader.csv?

I am using Spark 3.4.1, which was determined by the Anaconda package manager on Windows 10 as compatible with a Python 3.9 environment.


Solution

  • Spark data sources share a lot of common behaviour and that's why the documentation is split into generic and source-specific parts. The functional description of the CSV data source is here. There is a link to this page in the Scala docs of one of the .csv() overloads. By default, if provided a directory, the source will try to load each and every file as a CSV file. You can change that behaviour by setting a glob filter with .option("pathGlobFilter", "*.csv"). Path globbing is a generic trait shared by the file-based data sources and as such is described in one of the generic sections of the SQL data sources guide.