azure pyspark azure-data-lake azure-synapse azure-data-lake-gen2

When should you use a mount point in Azure Synapse Analytics?

The documentation of Azure Synapse Analytics mentions two ways read/write data to an Azure Data Lake Storage Gen2 using an Apache Spark pool in Synapse Analytics.

Reading the files directly using the ADLS store path

adls_path = "abfss://<containername>@<accountname>.dfs.core.windows.net/<filepath>"

df = spark.read.format("csv").load(adls_path)

Creating a mount point using mssparkutils and reading the files using the synfs path

mssparkutils.fs.mount( 
    "abfss://<containername>@<accountname>.dfs.core.windows.net", 
    "/data", 
    {"linkedService":"<accountname>"} 
) 

synfs_path = "synfs:/<jobid>/data/<filepath>"

df = spark.read.format("csv").load(synfs_path)

What is the difference between the two methods? When should you prefer to use a mount point?

Solution

Mount point is just like creating a virtual folder and mapping the location to Azure Storage

Pros of accessing Storage from a mount point:

Less complex code while accessing specific files from Datalake, no need to specify full path of storage every time you access them
You can access files like as they are in the local storage
You can have your data organized as folders as a centralized location

Cons:

Not much efficient when you need to access multiple directories from Azure Storage, mapping multiple directories confuses and makes a mess