I have some containers in ADLS (gen2) and have multiple folders within that container. I would like to have a mechanism to scan those folders to infer their schema and detect partitions and update them in the data catalog. How do I achieve this functionality in Azure?
Sample:
- container1
---table1-folder
-----10-12-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-13-1970
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----10-14-1970
-------files1.parquet
-------files2.parquet
----table2-folder
-----zipcode1
-------files1.parquet
-------files2.parquet
-------files3.parquet
-----zipcode2
-------files1.parquet
-------files2.parquet
...
So, what I expect is that in the catalog, it will create two tables (table1 & table2) where table1 will have date-based partitions (3 dates for this case) and have underline data within that table. Same for table2 which will have two partitions and their underline data.
In the AWS world, I can run a Glue crawler that can crawl these files, infers schemas and partitions, and populate Glue data catalogs, later I can query them through Athena. What's the Azure equivalent approach to achieve something similar?
I would recommend looking at Azure Synapse Analytics Serverless SQL. You can create a view which consumes the folders and does partition elimination if you follow this approach:
-- If you do not have a Master Key on your DW you will need to create one
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>' ;
GO
CREATE DATABASE SCOPED CREDENTIAL msi_cred
WITH IDENTITY = 'Managed Service Identity' ;
GO
CREATE EXTERNAL DATA SOURCE ds_container1
WITH
( TYPE = HADOOP ,
LOCATION = 'abfss://container1@mystorageaccount.dfs.core.windows.net' ,
CREDENTIAL = msi_cred
) ;
GO
CREATE VIEW Table2
AS SELECT *, f.filepath(1) AS [zipcode]
FROM
OPENROWSET(
BULK 'table2-folder/*/*.parquet',
DATA_SOURCE = 'ds_container1',
FORMAT='PARQUET'
) AS f
Then setup Azure Purview as your data catalog and have it index your Synapse Serverless SQL pool.