Search code examples
sql-serverdatabasereporting-servicesssasdata-warehouse

Custom/dynamic categorizations in data warehouse


We are in the process of expanding our data warehouse and we've come to a conundrum that nobody on the team has been able to solve. I tried searching for answers on Google, but it's hard to get the right terms.

Essentially, the problem is this. We have a data warehouse that keeps track of items. We have to do a lot of reporting out of this data warehouse. A lot of reports are driven by categorizing these items and then aggregating data according to the categorizations. Some reports share categorizations, and some may have their own that are specific to the report. Even in that case, there may be subsequent reporting or analysis (ad-hoc) that will need to reuse these categorizations. And there will likely be a lot of these categorization schemes. Furthermore, these schemes may rely on data from more than one fact table in order to categorize an item (and item information spans multiple fact tables due to different grain).

Our initial plan was to have dimensions for these, one for each categorization, or perhaps a junk dimension with a column for each category. And that would be fine if the number of categorizations is likely to remain static. However, there will always be new reports and analytical needs. We don't want to have to change the schema and the ETL process every time we want to add a new categorization scheme. We might also want to change the logic for a categorization without having to reimport the data or rerun parts of the ETL to recalculate.

So our options seem to be as follows:

  1. Just put the logic in report and analysis queries, even if that means copying and pasting categorization logic from query to query
  2. Use the dimension approach, with its flaws
  3. Have functions for calculating categories, which can be used in queries, but will be costly because they may have to do additional queries inside of them (bad performance)
  4. Have views on top of the fact tables that include extra category columns; the views can more easily be changed than regular table schemas, and don't require modifications to the ETL process
  5. Use some sort of bridge table scheme to implement a many-to-many mapping between categories and items in each category; this increase query complexity, but reduces scheme modifications to none; ETL would still need to be modified, but the modification could probably be in a smaller area (maybe a single query or procedure to update the category mappings)

We want this system to be accessible, such that users won't need to do much, if anything, more to access these categorization fields as they would to access regular fact and dimension fields. We also want to avoid having to make a lot of changes to the schema and ETL process every time a new categorization schema is added to the database.

So my question is, essentially, are there ways better than the five I listed that can be used to solve this problem? Are there variations on those five that more effectively solve the problem? Or is it just a hard problem that will require some degree of suffering? Maybe I'm going about this all wrong, so any feedback in that respect would be helpful too.

tl;dr: we need ways to have a lot of different categorizations of items in the data warehouse and are not sure of the most efficient, easily managed or easy to use system.

EDIT: additional info: we are using SQL Server, with SSRS as the primary reporting front-end, and SSAS as a secondary front-end for analysis or ad-hoc querying.


Solution

  • Based on the discussion in the comments above, your best bet would probably be to build a small little .NET app or web app that allows the users to define categorizations themselves based on dimensional attributes.

    You could define categories in a "Category Table"

    CategoryID  
    Category Name
    

    You would then build a set of "mapping" tables that essential create a snowflake off each dimension to a category:

    Category - Dim1 Mapping (CATEGORY_D1)

    CategoryID
    Dim1ID
    

    Category - DimX Mapping

    CategoryID
    DimXID
    

    Your little app would maintain and build these mapping tables as categories are defined.

    When you build reports then, you define your joins between the Fact Table, Dim Tables, and Category Table.

    If I wanted to find all items in Category 3, then I would write:

    SELECT * 
    FROM ITEMS_FACT F
    JOIN DIM1 D1 ON (F.DIM1_ID = D1.DIM1_ID)
    JOIN CATEGORY_D1 CD2 ON (CD2.DIM1_ID = D1.DIM1_ID 
                             AND CD1.CATEGORY_ID = 3)
    JOIN DIM2 D2 ON (F.DIM2_ID = D1.DIM2_ID)
    JOIN CATEGORY_D2 CD2 ON (CD2.DIM2_ID = D1.DIM2_ID 
                             AND CD2.CATEGORY_ID = 3)
    

    This would allow users to define any category they wanted, and there are no ETL changes at all (unless you have SCD type 2s - you may need to apply categories when the rows are updated).

    If there are categories that manage "exceptions", you may need to build one category mapping table with all the dim combinations on them:

    CategoryID
    Dim1ID
    Dim2ID
    Dim3ID
    

    You would then let the users define the logic in their tool (If Item has Dim1 Attr2 = 'A' and Dim2 Attr3 = 'B' then category 1), and then the mapping table is built based on that.

    Your join would be a bit simpler then - just join the dims & fact to the category mapping on the dimension keys.

    SELECT * 
    FROM ITEMS_FACT F
    JOIN DIM1 D1 ON (F.DIM1_ID = D1.DIM1_ID)
    JOIN DIM2 D2 ON (F.DIM2_ID = D2.DIM2_ID)
    JOIN CATEGORY_MAP CM ON (D1.DIM1_ID = CM.DIM1_ID 
                             AND D2.DIM2_ID = CM.DIM2_ID
                             AND CD1.CATEGORY_ID = 3)