Search code examples
powerbipowerquerypowerbi-desktopdata-cleaningdata-modeling

What is best practice to deal with missing data according to Kimball?


I have a data base with the following tables:

Customers, Invoices, Salesman, Target.

The ones concerned about my question are Customers, Invoices.

There are customersIDs used in the Invoices but doesn't exist in the Customers table.

If I used only the customers from Customers Table, my customer dimension would be incomplete.

My solution is to append these IDs from Invoices to Customers and fill other columns in the Customers table with nulls.

I don't know if this is the best approche according to Kimball?

also, if it is a good solution, how can I add accomplish it with Power bi desktop?

Customers table: "generated Data"

enter image description here

Invoice table:

enter image description here

..... just a sample the table is thousands of rows.


Solution

  • There's two points here:

    Firstly, (in import mode at least) PBI already creates the "blank row" for items present in your fact table but missing from your dimension table for precisely this scenario. If you don't need the granularity of each individual missing customer id, then you don't need to do anything.

    Secondly, if you need to to retain that granularity then your approach is the correct one. The way to do this in Power Query is as follows:

    1. Create a new query which takes your customer dimension table and does a left outer join on customer id with your invoice fact table.
    2. Expand the newly joined table but retain only the new customer id column.
    3. Remove all columns apart from the new customer id column.
    4. Remove duplicates
    5. You now have a list of missing customer ids. Ensure the column name is the same as the column name of you customer id in the customer dimension table. Append this to the original customer dimension query and the nulls will be filled in automatically for the missing columns.