Search code examples
powerbidaxpowerquerypowerbi-desktopdata-modeling

power query design for performance of queries (using SOAP connector) for Power BI/Excel


I need to manage a large dataset of approx 100.000 records per year over many years.

The data is retrieved with SOAP connectors using custom query's.

Each records has 'basic info' and 'historic info' which need to be extracted with different SOAP connectors.

In the future, the records will need to be retrieved from other DB/source/API for new records/years.

Retrieving 100.000 records of basic info over SOAP takes about 2minutes. Including the historic info about 20-30 minutes (each records has an average of 10 'hitoric records' as 'child')

Only the 'present' period of data needs to be updated regularly. Old years in fact 'never' need to be updated, or just on sporadic explicit demand, eg, when adding a new attribute.

So I tried the following design in order to prevent excessive update time:

  • I create a function getDataBySOAP(start,end,includeHistory). This does all the work (with different SOAP requests etc)
  • Then 1 query for each year of data, lets say 'data22', data23, data24 etc, just calling the SOAP function (so in the fucture data for another period could be retrieved from another function/DB). All query's were set to never update automatically (all checks off), excepts for the 'current' query
  • Then 1 query, say allData, appendin all year data together

But now, when I say 'update' on the allData, then the update updates everyting, including history. So then we're set of to .. have MANY coffees before the update is ready, because all queries are updated.

What is going wrong? Should I design the queries differently? Now the hostoric data is in additional optional columns. but maybe I sould seperate then in seperate queries .. but even then , it seems like an update would trigger a FULL update, taking hours.

Note: I an rather expert in Excel/VBA. But this is my first Power query/Power BI project .. so really a novice

Thanks for your tips. C.


Solution

  • That's just how Power Query works. If you have a master query that references the other queries, then they will all be re-evaluated each refresh. In your case, why not have the master query as a calculated table in DAX. That way, the historic tables will be stored and not updated (also mark these as hidden as they won't be used apart from in a calculated table), the new table can be set to update and the calculated table will append them all together. A drawback is the additional storage required (you will be storing historic data twice) but this will be offset with a faster refresh time.

    Another option is incremental refresh is you want to configure this properly: https://learn.microsoft.com/en-us/power-bi/connect-data/incremental-refresh-overview