apache-spark azure-synapse azure-synapse-analytics

Read file from filesystem using Spark in Synapse

I'm trying to read a file as a DataFrame in a Synapse Spark notebook using spark.read.format('csv').load('/path/to/file'). The file is located in an ADLS Gen2, which is mounted on all Spark pools in the Synapse workspace (ADLS mounted with a workspace scope). The mounted ADLS is accessible using the filesystem APIs like any other file in the cluster filesystem. I double checked, and I can see the file using os.listdir('/path/to/file/directory'). However, I get the following error when I try to read the file using the Spark API:

Path does not exist: abfss://<container>@<storage account>.dfs.core.windows.net/path/to/file

Given the error message, it looks like spark.read.format().load() is trying to access the ADLS directly, instead of going to the path where the ADLS is mounted. Is there a way for spark.read.format().load() to read a file in the filesystem, instead of using the abfss path?

Edits:

Clarify the ADLS is mounted with a workspace scope.

Solution

When you already have the mount point to ADLS Gen2, access it using the format mentioned in this document.

/synfs/{jobId}/<mount_point_name>/{path_to_file}

To find jobId, run the code below.

mssparkutils.env.getJobId()

You can also get the path using the command below.

mssparkutils.fs.getMountPath(<mount_point_name>)

Next, provide this path to load the data.

Example:

%%pyspark 

df = spark.read.load("synfs:/49/test/myFile.csv", format='csv') 
df.show()

This solution also works with ADLS Gen2 mounted with a workspace scope (like the one in the question question). If that is the case, replace {jobId} with workspace in the previous paths. For example:

df = spark.read.load("synfs:/workspace/{container name}/{path to file}", format='csv')