Search code examples
azurepysparkazure-data-lakeazure-synapseazure-data-lake-gen2

When should you use a mount point in Azure Synapse Analytics?


The documentation of Azure Synapse Analytics mentions two ways read/write data to an Azure Data Lake Storage Gen2 using an Apache Spark pool in Synapse Analytics.

  1. Reading the files directly using the ADLS store path
adls_path = "abfss://<containername>@<accountname>.dfs.core.windows.net/<filepath>"

df = spark.read.format("csv").load(adls_path)

  1. Creating a mount point using mssparkutils and reading the files using the synfs path
mssparkutils.fs.mount( 
    "abfss://<containername>@<accountname>.dfs.core.windows.net", 
    "/data", 
    {"linkedService":"<accountname>"} 
) 

synfs_path = "synfs:/<jobid>/data/<filepath>"

df = spark.read.format("csv").load(synfs_path) 

What is the difference between the two methods? When should you prefer to use a mount point?


Solution

  • Mount point is just like creating a virtual folder and mapping the location to Azure Storage

    Pros of accessing Storage from a mount point:

    1. Less complex code while accessing specific files from Datalake, no need to specify full path of storage every time you access them
    2. You can access files like as they are in the local storage
    3. You can have your data organized as folders as a centralized location

    Cons:

    1. Not much efficient when you need to access multiple directories from Azure Storage, mapping multiple directories confuses and makes a mess