Search code examples
azureazure-storageazure-data-lakeazure-databricks

Azure Data Lake Gen2 vs Storage account


I have a requirement to process some big data and planning to deploy Databricks cluster & a storage technology. Currently evaluating Data Lake Gen2 which supports both object and file storage. The storage account (blob, file, table, queue) also has similar capabilities which can handle both file based and object based storage requirements. I am bit puzzled to go for an option because of these similarities. Can someone clarify the following questions please?

  1. Except HDFS support, what else is a significant feature that I should use Data Lake Gen2 against Storage Account?
  2. Storage Account v2 with Hierarchical namespace enabled == Data Lake Gen2. If so, can I use File System to create file shares and mount them in my VM as like Storage acc's File system?
  3. For accessing data from Databricks, which one of these two will be better for big data workloads. I can see Storage account can also be mounted as DBFS which can still leverage the distributed processing.

Solution

  • Except HDFS support, what else is a significant feature that I should use Data Lake Gen2 against Storage Account?

    Answer: There're also other benefits. In short, the benefits are Performance / Management / Security as well it's cost. For more details, you can refer to this official article.

    Storage Account v2 with Hierarchical namespace enabled == Data Lake Gen2. If so, can I use File System to create file shares and mount them in my VM as like Storage acc's File system?

    Answer: Of course, the ADLS Gen2 supports file shares mount as the blob storage does.

    For accessing data from Databricks, which one of these two will be better for big data workloads. I can see Storage account can also be mounted as DBFS which can still leverage the distributed processing.

    Answer: ADLS Gen2 can also be mounted as DBFS. And as per Answer 1, the better one should be ADLS Gen2.