Search code examples
apache-sparkazure-synapseazure-synapse-analytics

Read file from filesystem using Spark in Synapse


I'm trying to read a file as a DataFrame in a Synapse Spark notebook using spark.read.format('csv').load('/path/to/file'). The file is located in an ADLS Gen2, which is mounted on all Spark pools in the Synapse workspace (ADLS mounted with a workspace scope). The mounted ADLS is accessible using the filesystem APIs like any other file in the cluster filesystem. I double checked, and I can see the file using os.listdir('/path/to/file/directory'). However, I get the following error when I try to read the file using the Spark API:

Path does not exist: abfss://<container>@<storage account>.dfs.core.windows.net/path/to/file

Given the error message, it looks like spark.read.format().load() is trying to access the ADLS directly, instead of going to the path where the ADLS is mounted. Is there a way for spark.read.format().load() to read a file in the filesystem, instead of using the abfss path?

Edits:

  • Clarify the ADLS is mounted with a workspace scope.

Solution

  • When you already have the mount point to ADLS Gen2, access it using the format mentioned in this document.

    /synfs/{jobId}/<mount_point_name>/{path_to_file}
    

    To find jobId, run the code below.

    mssparkutils.env.getJobId()
    

    You can also get the path using the command below.

    mssparkutils.fs.getMountPath(<mount_point_name>)
    

    Next, provide this path to load the data.

    Example:

    %%pyspark 
    
    df = spark.read.load("synfs:/49/test/myFile.csv", format='csv') 
    df.show()
    

    This solution also works with ADLS Gen2 mounted with a workspace scope (like the one in the question question). If that is the case, replace {jobId} with workspace in the previous paths. For example:

    df = spark.read.load("synfs:/workspace/{container name}/{path to file}", format='csv')