Search code examples
sql-serverssisdata-warehousebusiness-intelligence

Data Warehouse design - Handling NULL and empty values in the OLTP


I am creating a DW for an OLTP that is creaking somewhat.

A problem I'm faced with is that there isn't much data integrity in the OLTP database. An example would be a Suburb field.

This suburb field is a free text field on the OLTP UI which means we've got values in the field, plus we've got empty strings and we've got NULL values.

How would we usually handle this? The scenarios I've come up with are:

  1. Import data as is (not ideal)
  2. In my ETL process, treat any empty string the same as a NULL and replace that with the word 'Unknown' in the DW
  3. Import both empty strings and NULL's as empty strings in the DW

Just FYI, I'm using the Microsoft BI stack (SQL Server, SSIS, SSAS, SSRS)


Solution

  • The short answer is, it depends on what NULL and empty strings mean in the source system.

    This general question (handling NULL) has been discussed a lot, e.g. here, here, here etc. I think the most important point to remember is that a data warehouse is just a database; it may have a very specific type of schema and be designed for one purpose, but it's still just a database and any general advice on NULL still applies.

    (As a side note, I sometimes prefer to talk about a "reporting database" rather than a "data warehouse", because it keeps things in perspective. Some DBAs and developers start making plans for huge server farms and multi-year ETL projects as soon as they hear the words "data warehouse", but in the end it's just a reporting database.)

    Anyway, it isn't completely clear where you want to use NULL but it looks like it may be an attribute on a dimension.

    I (probably) wouldn't use any of your three approaches, but it depends on the meaning of your data. Importing the data as-is is not useful because part of the value of a data warehouse is that the data has been cleaned and is consistent, which makes querying and comparing data along other dimensions much easier.

    Replacing empty strings with 'Unknown' may or may not be correct: what does an empty string mean in the source system? There's a big difference between "it means there's no suburb" and "it means we don't know if there's a suburb". Assuming that an empty string means "no suburb" and NULL means "unknown" then I would import the empty strings as they are, but replace NULL with 'Unknown'. The main reason for doing that is that if the Suburb field will be used as a filter condition in a report, it's easier for users (and possibly your reporting tool) to work with a non-NULL value like 'UNKNOWN'. And if there is no consistency in the source system and you don't know what empty strings and NULLs mean, then you need to clarify that first and ideally fix the source system too (another benefit of a DWH is that it helps to identify inconsistencies and data handling errors in source systems).

    Your last idea to convert NULLs to empty strings is the same issue: what does a NULL actually mean in the source system? If it means "no suburb" then replacing it with an empty string is probably a good idea, but if it means something else then you should handle it as something else.

    So to summarize, my preference would be to import empty strings as-is, and convert NULL to 'UNKNOWN', but I can't be sure that this actually makes sense in your case. There's no single answer to this question because it all depends on your specific data and what it means. But there's no problem with using NULL in a data warehouse (or any other database) as long as you do it consistently and with a clear understanding of how the source systems handle data.