azure azure-data-factory azure-databricks azure-data-lake azure-purview

Handling partitioned data in Azure?

I have some containers in ADLS (gen2) and have multiple folders within that container. I would like to have a mechanism to scan those folders to infer their schema and detect partitions and update them in the data catalog. How do I achieve this functionality in Azure?

Sample:

- container1
---table1-folder
-----10-12-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-13-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-14-1970
-------files1.parquet
-------files2.parquet
----table2-folder
-----zipcode1
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----zipcode2
-------files1.parquet
-------files2.parquet

...

So, what I expect is that in the catalog, it will create two tables (table1 & table2) where table1 will have date-based partitions (3 dates for this case) and have underline data within that table. Same for table2 which will have two partitions and their underline data.

In the AWS world, I can run a Glue crawler that can crawl these files, infers schemas and partitions, and populate Glue data catalogs, later I can query them through Athena. What's the Azure equivalent approach to achieve something similar?

Solution

I would recommend looking at Azure Synapse Analytics Serverless SQL. You can create a view which consumes the folders and does partition elimination if you follow this approach:

-- If you do not have a Master Key on your DW you will need to create one
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>' ;

GO

CREATE DATABASE SCOPED CREDENTIAL msi_cred 
WITH IDENTITY = 'Managed Service Identity' ;

GO

CREATE EXTERNAL DATA SOURCE ds_container1
WITH 
  ( TYPE = HADOOP , 
    LOCATION = 'abfss://container1@mystorageaccount.dfs.core.windows.net' , 
    CREDENTIAL = msi_cred
  ) ;

GO

CREATE VIEW Table2
AS SELECT *, f.filepath(1) AS [zipcode]
FROM
    OPENROWSET(
        BULK 'table2-folder/*/*.parquet',
        DATA_SOURCE = 'ds_container1',
        FORMAT='PARQUET'
    ) AS f

Then setup Azure Purview as your data catalog and have it index your Synapse Serverless SQL pool.