database database-design data-modeling data-warehouse

Null Foreign Keys in Fact Table

The data source I'm using for the fact table may include missing Data (NULL). For example for an employee, some "Department names" are missing and I already have "Dim_Department" in my model.

I've read on some websites that there should not be NULL Foreign Keys according to best practices. Here what I'm thinking:

1) Add a row in the dimension tables like this:

Department ID / Deparmtent CODE / Department Name

0                NA                   Non Available

2) I think I should edit my source file for null columns should be replaced with "NA" so when I join with "Department CODE" I retrieve the Department ID in the fact table

Is my approach correct because I don't want to lose the rows from my fact table ?

Solution

What is interesting in your scenario is that a Department code of 'NA' is not a fact at all, there is no physical department with that name. This simple scenario can induce downstream issues with your ETL or application processing. There is in fact no department. The concept of NA here makes it easier for humans to process the final result, but adds a layer of complexity that most of the analysis needs to exclude records linked to the NA department. We don't want to see the NA department win any awards for performance or customer satisfaction... Instead of injecting the labels NA or Not Available into the data itself, this type of label formatting should be applied in the reporting layer and can easily be applied in the final projection if NULLs are used.

NULL is an incredibly powerful tool for indicating that FK relationships do not exist. The general recommendation to avoid them is based around the assumption that your DBAs or developers querying the DB/DW will not be expecting nulls and will not design their queries to support nulls.

The way NULL FKs work and how to exploit them best are a harder concept to teach to beginners and as such it can easily cause unexpected results if you do not understand how to work with them, but that is not a valid excuse to say we should not be using them.

The problem with assigning an arbitrary number or value to the FK that represents a missing related entity is that you now need to either remove the FK constraint or you need to create a matching arbitrary record in the principal table. Then in all of your queries or analysis you need to account for these arbitrary values and decide when to exclude or include them in your results.

Although mapping missing records to a zero record will work, if you do not exclude theses results from queries you could easy bring back many dimensions that you should not be processing, because they represent fake or arbitrary entities.

If every query and join statement requires you to append a criteria like AND ParentTable_Id <> 0 then you are actually introducing more technical debt into your solution, new DBAs and developers will need to remember when and where to include these arbitrary criteria in every query. When using nullable FKs the schema itself documents this relationship constraint, we do not have to consult specific records or rely on arbitrary conventions.

Many functions and joins will ignore a NULL FK and in most cases this would be both correct and expected for the query. My experience is that the number of exceptional cases where we do want to include the nulls is significantly less than the times where we would like to exclude the nulls.

I would strongly argue that debugging dataset or analysis queries that are incorrect due to including or excluding related entities incorrectly takes the same effort, except that the NULLs ability to automatically exclude records from joins or aggregates is generally more convenient and the expected outcome, so when using NULL you end up debugging a whole lot less.

Due to this being a very long running contentious argument since its inception in the 70s, if your database does use nullable FKs, it is still wise to broadcast this as a deliberate convention.

There is of course 1 major caveat to this discussion. If your DW needs to support populating tables with dimensions independently and potentially out of order, then you may find that arbitrary records will help you identify between records that are genuinely undefined, as opposed to invalid, Expected or simply Missing from the DW but have a real value in the source/external system. Review your ETL processes to understand how using or not using null FKs will impact your data flow.