Search code examples
azureazure-data-factoryazure-databricksazure-data-lakeazure-purview

Handling partitioned data in Azure?


I have some containers in ADLS (gen2) and have multiple folders within that container. I would like to have a mechanism to scan those folders to infer their schema and detect partitions and update them in the data catalog. How do I achieve this functionality in Azure?

Sample:

- container1
---table1-folder
-----10-12-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-13-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-14-1970
-------files1.parquet
-------files2.parquet
----table2-folder
-----zipcode1
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----zipcode2
-------files1.parquet
-------files2.parquet

...

So, what I expect is that in the catalog, it will create two tables (table1 & table2) where table1 will have date-based partitions (3 dates for this case) and have underline data within that table. Same for table2 which will have two partitions and their underline data.

In the AWS world, I can run a Glue crawler that can crawl these files, infers schemas and partitions, and populate Glue data catalogs, later I can query them through Athena. What's the Azure equivalent approach to achieve something similar?


Solution

  • I would recommend looking at Azure Synapse Analytics Serverless SQL. You can create a view which consumes the folders and does partition elimination if you follow this approach:

    -- If you do not have a Master Key on your DW you will need to create one
    CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>' ;
    
    GO
    
    CREATE DATABASE SCOPED CREDENTIAL msi_cred 
    WITH IDENTITY = 'Managed Service Identity' ;
    
    GO
    
    CREATE EXTERNAL DATA SOURCE ds_container1
    WITH 
      ( TYPE = HADOOP , 
        LOCATION = 'abfss://container1@mystorageaccount.dfs.core.windows.net' , 
        CREDENTIAL = msi_cred
      ) ;
    
    GO
    
    CREATE VIEW Table2
    AS SELECT *, f.filepath(1) AS [zipcode]
    FROM
        OPENROWSET(
            BULK 'table2-folder/*/*.parquet',
            DATA_SOURCE = 'ds_container1',
            FORMAT='PARQUET'
        ) AS f
    

    Then setup Azure Purview as your data catalog and have it index your Synapse Serverless SQL pool.