Search code examples
scalaazure-data-lakeazure-databricksazure-data-lake-gen2

Azure Datalake Store Gen2 read files from Databricks using a scala spark library


I'm trying to deploy a Scala library on Azure Databricks (not a notebook) to perform some calculations. I'm trying to read some avro files from an Azure Datalake Store Gen 2 directory, do some operations and then store it again using avro in another directory.

I'm following this guide.

My understanding is that I need to mount the directory of the Azure Datalake so that I can read the avro files directly form there, so I need to do something like this:

dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)

My problem is that I don't know how to import that "dbutils" object into my project. I'm also using the Java SDK library (version 12.0.0-preview.6) in order to retrieve the files, but basically I don't know how to do it with Databricks.

Any help or hint would be greatly appreciated.


Solution

  • The Azure Storage Java SDK is not necessary if you are going to mount the directory using dbutils (or vice versa).

    The dbutils mount can be used to mount the storage account once, so afterwards you can just use the /mnt path.

    You can find the dbutils in following repository:

    libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.4"
    

    More info at: https://docs.databricks.com/dev-tools/databricks-utils.html#databricks-utilities-api-library

    You can always use the abfss path directly as well, so not strictly necessary to mount things.