I am spinning up on Python, Spark, and PySpark. I am following this
tutorial
on reading all CSV files in a specified directory, i.e.,
spark.read.csv(DirectoryPath)
. I tried to find where the
description of which files are read, e.g., all files regardless of
extension or just files with the CSV extension. I consider it to be a
very basic expectation for the documentation to describe this
behaviour. The spark.read
method returns a DataFrameReader
object, of which csv
is a method.
The doc string from entering spark.read.csv?
at the Spyder IPython
console only seems to describe the case where a CSV file path is
provided, not where a directory path is given. The PySpark
documentation
says essentially the same thing.
Since Spark is written in Scala, I sometimes found the Scala
documentation better. However, none of the descriptions for the
three csv
methods
seem to describe the case where the path argument is a directory
rather than a file.
What is the functional description of DataFrameReader.csv
?
I am using Spark 3.4.1, which was determined by the Anaconda package manager on Windows 10 as compatible with a Python 3.9 environment.
Spark data sources share a lot of common behaviour and that's why the documentation is split into generic and source-specific parts. The functional description of the CSV data source is here. There is a link to this page in the Scala docs of one of the .csv()
overloads. By default, if provided a directory, the source will try to load each and every file as a CSV file. You can change that behaviour by setting a glob filter with .option("pathGlobFilter", "*.csv")
. Path globbing is a generic trait shared by the file-based data sources and as such is described in one of the generic sections of the SQL data sources guide.