Search code examples
data-warehouse

Querying a data warehouse based on a varible date


I have a dimension table that tracks changes to an object similar to an employee object. However the employee has a "status" (Pending, active, inactive etc.) and start and end dates for each status which is tracked in another table in the source db.

There is no limit to how many times the status can update. It can go back and forth between statuses.

How do I model this if I wish to create a report such as employees with a given status on a specific day. If I use the status changes as the fact, I will only have the additional information from employee for dates the status changed on.

Do I need to create a fact table that updates daily with the current employee records and status?

Or can I create a fact table with effective from and to dates like in the dimension table?

Or am I way off on this and should be approaching it differently?


Solution

  • The first thing you must decide is whether you will poll the employee or you get an event feed of the employee status change.

    The first option leads to simplified design, you load periodically (say once per day) the employee status of all employees and build your dimension.

    Note that this is a aproximation, because the employee can swicht the status several times a day, but you consider only one status per day.

    The tables is as follows

     employee_id, 
     validfrom_date, 
     validto_date, 
     status
    

    The validfrom_date is the extraction date, validto_date is calculated. You discard all employees that do no change in the status.

    The second option is more involved, but produce more exact results.

    You load all status changes of all employees from the source system with exact timestamp, so it is possible to cover more changes per day.

    Possible interface is:

     employee_id, change_timestamp, old_status, new_status
    

    Note that the old_status is redundant and the value may be used to check if the interface is consistent.

    The final table is similar to the previous one, only timestamps are used instead of dates.

     employee_id, 
     validfrom_timestamp, 
     validto_timestamp, 
     status
    

    Again the validfrom_timestamp is the timestamp from the event from the interface, the validto_timestamp is calulated.

    In this setup a periodical check of the consistence of your dimension is recommended.

    The problem - if some change event get lost, you never recover it. Through the time you may accumulate such errors. So say once per month you check, if your actual status of the dimension matches the status in the source system. If not you fix the difference.

    Finally do not speculate if this is fact or dimension table. In the Kimball's model there is no strict distinction between them.

    Take an Employee table. If you reports the employee status the role is fact table. If you use it to join to other fact tables the role is dimension.