Search code examples
azurepowerbidatabricksazure-analysis-services

Load data from Databricks to Azure Analysis Services (AAS)


Objective

I'm storing data as Delta Lake format at ADLS gen2. Also they are available through Hive catalog. It's important to notice that we're currently using PowerBI, but in future we may switch to Excel over AAS.

Question

What is the best way (or hack) to connect AAS to my ADLS gen2 data in Delta Lake format?

The issue

There are no Databricks/Hive among AAS supported sources. AAS supports ADLS gen2 through Blob connector, but AFAIK, it doesn't support Delta Lake format, only parquet.

Possible solution

From this article I see that the issue may be potentially solved with PowerBI on-premise API gateway:

One example is the integration between Azure Analysis Services (AAS) and Databricks; Power BI has a native connector to Databricks, but this connector hasn’t yet made it to AAS. To compensate for this, we had to deploy a Virtual Machine with the Power BI Data Gateway and install Spark drivers in order to make the connection to Databricks from AAS. This wasn’t a show stopper, but we’ll be happy when AAS has a more native Databricks connection.

The issue with this solution is that we're planning to stop using PowerBI. I don't quite understand how it works, what PBI license and implementation/maintenance efforts it requires. Could you please provide deeper insight on how it'll work?

UPD, 26 Dec 2020

Now, when Azure Synapse Analytics is GA, it has full support of SQL on-demand. That means that serverless Synapse may theoretically be used as a glue between AAS and Delta Lake. See "Direct Query Databricks' Delta Lake from Azure Synapse".

In the same time, is that possible to query Databricks Catalog (internal/external) from Synapse on-demand using ODBC? Synapse supports ODBC as external source.


Solution

  • Power BI Dataflows now supports Parquet files, so you can load from those files to Power BI, however the standard design pattern is to use Azure SQL Data Warehouse to load the file then layer Azure Analysis Service (AAS) over that. AAS does not support parquet, you would have to create a CSV version of the final table, or load it to a SQL Database.

    As mentioned the typical architecture, is to have Databricks do some or all of the ETL, then have Azure SQL DW sit over it.

    enter image description here

    Azure SQL DW has now morphed into Azure Synapse, but this has the benefit of that a Databricks/Spark database now has a shadow copy but accessible by the SQL on Demand functionality. SQL on Demand doesn't require to to have an instance of the data warehouse component of Azure Synapse, it runs on demand, and you per per TB of query. A good outline of how it can help is here. The other option is to have Azure Synapse load the data from external table into that service then connect AAS to that.