Search code examples
azuredatabricksazure-databricks

difference between DBFS and databrciks Volumes


What is the difference between DBFS and volumes ?

Does Volumes belong to DBFS or what exactly in terms of architecture?

I want to understand the position of volumes as well as their advantages compared to DBFS ?

any one can help me please ?


Solution

  • What is the difference between DBFS and volumes ?

    Sort of a duplicate to DBFS AZURE Databricks -difference in filestore and DBFS

    Assuming you mean Volumes as in https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-volumes?wt.mc_id=MVP_323223 then you can say while both DBFS and volumes deal with data storage in Databricks, they are used in different contexts and have different functionalities. DBFS is more about providing an interface for interacting with cloud object storage, while volumes in Databricks SQL are about providing a way to access, store, govern, and organize files in a cloud object storage location.

    DBFS and volumes in Databricks serve different purposes and have different functionalities:

    DBFS (Databricks File System):

    • DBFS is a distributed file system mounted into a Databricks workspace and available on Databricks clusters.
    • It is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage API calls.
    • DBFS provides convenience by mapping cloud object storage URIs to relative paths. This allows you to interact with object storage using directory and file semantics instead of cloud-specific API commands.
    • DBFS allows you to mount cloud object storage locations so that you can map storage credentials to paths in the Databricks workspace.

    Volumes in Databricks SQL:

    • Volumes are Unity Catalog objects representing a logical volume of storage in a cloud object storage location.
    • They provide capabilities for accessing, storing, governing, and organizing files.
    • While tables provide governance over tabular datasets, volumes add governance over non-tabular datasets.
    • A volume can be either managed or external.
    • The path to access files in volumes uses the following format: /Volumes/<catalog_identifier>/<schema_identifier>/<volume_identifier>/<path>/<file_name>.

    Does Volumes belong to DBFS or what exactly in terms of architecture?

    DBFS and volumes in Databricks SQL are separate components.

    • DBFS is a distributed file system for interacting with cloud object storage.
    • Volumes in Databricks SQL are Unity Catalog objects for accessing, storing, governing, and organizing files in cloud object storage.

    They do not belong to each other but interact with the same cloud object storage in different ways.

    I want to understand the position of volumes as well as their advantages compared to DBFS ?

    As mentioned, they serve different purposes.

    • Volumes in Databricks SQL provide governance over non-tabular datasets and flexible data management.
    • DBFS is a file system for interacting with cloud storage. They are separate but interact with the same cloud storage differently.