I'm trying to read a file as a DataFrame in a Synapse Spark notebook using spark.read.format('csv').load('/path/to/file')
. The file is located in an ADLS Gen2, which is mounted on all Spark pools in the Synapse workspace (ADLS mounted with a workspace
scope). The mounted ADLS is accessible using the filesystem APIs like any other file in the cluster filesystem. I double checked, and I can see the file using os.listdir('/path/to/file/directory')
. However, I get the following error when I try to read the file using the Spark API:
Path does not exist: abfss://<container>@<storage account>.dfs.core.windows.net/path/to/file
Given the error message, it looks like spark.read.format().load()
is trying to access the ADLS directly, instead of going to the path where the ADLS is mounted. Is there a way for spark.read.format().load()
to read a file in the filesystem, instead of using the abfss path?
Edits:
workspace
scope.When you already have the mount point to ADLS Gen2, access it using the format mentioned in this document.
/synfs/{jobId}/<mount_point_name>/{path_to_file}
To find jobId
, run the code below.
mssparkutils.env.getJobId()
You can also get the path using the command below.
mssparkutils.fs.getMountPath(<mount_point_name>)
Next, provide this path to load the data.
Example:
%%pyspark
df = spark.read.load("synfs:/49/test/myFile.csv", format='csv')
df.show()
This solution also works with ADLS Gen2 mounted with a workspace
scope (like the one in the question question). If that is the case, replace {jobId}
with workspace
in the previous paths. For example:
df = spark.read.load("synfs:/workspace/{container name}/{path to file}", format='csv')