Search code examples
powerbidatabricksazure-purviewdata-lineage

How can you create lineage between Power BI datasets and Databricks sql warehouse


My organization wants to have a complete overview of the lineage from Power BI reports to data warehouse (lakehouse architecture). The goal for now is to create a PoC using Purview.

My question is, how can I link Power BI assets to the specific tables used with Databricks SQL endpoint's as a source?

As far as I know, there are no out-of-the-box connectors available or even open-source custom components. The problem I have is that using the Power BI API, you can only get the reference to the SQL endpoint (which of course makes sense). When Power BI datasets refresh, it uses SQL queries to retrieve data. I thought I might parse those queries from the query history in Databricks, however, I cannot find a way to determine which dataset refresh triggered the query. As a result, I cannot link the tables used by the query to the dataset.

Another thought I had was to parse .pbix files to identify the tables that are being used, but I have no experience with this. Any guidance on how to accomplish the lineage would be appreciated.


Solution

  • For everyone coming across this post trying to find a solution. There is a Power BI feature in preview (at time of writing: September 2023), Power BI project files. Using these project files which are now in readable format (model.bim) I am able to parse the file to get the used tables from a dataset and using this information to create lineage between Databricks assets & Power BI assets in Purview using the API.