Search code examples
ssasdata-modelingdata-warehousedimensionfact

How to model process and status history in a data warehouse?


Let's say that we have D_PROCESS, D_WORKER and D_STATUS as dimensions, and the fact F_EVENT that links a process (what) with a worker (who's in charge) and the "current" status.

The process status changes over time. Shoud we store in F_EVENT one line per process/status/worker, or one line per process/worker, and "somewhere else" one line per status change for a given process/worker?

I'm new to Datawarehouse and it's hard to find best practices/tutorial related to data modelization.


Solution

  • Read The Data Warehouse Toolkit by Ralph Kimball for a good introduction to dimensional modeling.

    It sounds like you are storing a process change event in F_EVENT. If this process has a defined beginning and end, I would build a snapshot fact table which would let you track the process over time (simply updating the row each time the process moves from one step to another).

    EDIT:

    I'll try to make a general case using your dimensions as examples.

    For D_PROCESS, modeling a "process" isn't usually modeled as a dimension, and you called it a "what", so I'm going to rename this to "D_ACCOUNT".

    The basic data model will be for a "tax processing system" in which WORKERS are processing ACCOUNTS, and each ACCOUNT/WORKER combination has several possible "STATUSES" of where this process currently stands.

    D_ACCOUNT
        ACCOUNT_NUMBER
        ACCOUNT_TYPE
    
    D_WORKER
        WORKER_ID
        FIRST_NAME
        LAST_NAME
        BADGE_NUMBER
        SHIFT
    
    D_STATUS
        STATUS_ID
        STATUS_NAME
    

    Now if I want to report on all "events" that have happened to an Account, performed by a worker, I can build a Transaction-level fact table F_EVENT:

    F_EVENT
        ACCOUNT_ID
        WORKER_ID
        STATUS_ID
        EVENT_TIME_ID
        Metrics taken at time of the measurement (Cost, Worker time spent, etc)
    

    We call the unique combination of dimensions that identifies a row the Granularity or Grain of the fact table.

    The grain of this table is Account, Worker, Status, and Time. It answer questions like "How much time did my workers on shift 3 spend processing accounts on Wednesday?" or "How many events occured that changed the processing status to "CLOSED"?

    I'm not sure how much this type of table would help.

    Instead, say you are interested in tracking the process itself as it moves through various statuses. I'm going to assume that the status always moves forward in time, from "NOT STARTED" to "IN PROCESS" to "CLOSED".

    I'll build what Kimball calls an "Accumulating Snapshot Fact table.

    F_TAXPROCESSING
        ACCOUNT_ID
        WORKER_ID 
        CURRENT_STATUS_ID
        NOT_STARTED_DTTM
        NOT_STARTED_FLAG
        IN_PROCESS_DTTM
        IN_PROCESS_FLAG
        CLOSED_DTTM
        CLOSED_FLAG
    

    This table's grain is Account, Worker. This table keeps track of the "process" by updating the date/time of the change to the status, and a flag when that status has been reached.

    This allows you to track the process over time, allowing you to see how many accounts have reacted the "IN PROCESS" status, how long it took to get there, et cetera.