Search code examples
databasedata-integritynetezza

How to overcome Netezza's lack of unique constraint/referential integrity enforcement?


It appears that the lack of support for enforcing 2 basic constraints (unique and foreign key), is a reason for loss of many man-hours debugging and troubleshooting difficult problems. What starts as a trivial, easy-to-fix problem (dup rows /inconsistent primary-detail tables) goes un-noticed, grows and raises complex edge-scenarios in our application, or even in our hardware (e.g. self-join with dups can cause inflation and storage exhaustion).

  • Netezza serves multiple purposes in our environment: production, research, qa and staging. Naturally our ETL processes can’t be mature and can’t verify all constraints in all those scenarios.
  • Even in our most mature applications used in production, where data is verified while ETL loads it, we create a series of tables each the result of a calculation on its predecessors. Sometimes the data-integrity is broken along the way, not right on the start (as a result of a buggy statement)

Can anyone recommend a methodology/tool for avoiding these headaches?


Solution

  • We originally wrote a stored procedure to handle this very thing in our data warehouse.
    While it works in theory, it is kind of slow for huge tables commonly found on NZ (~10 mins for a 500M record table).

    I'll explain the stored proc below, but for now I wanted to say that we're not even leveraging the proc at the moment because our incremental/upsert loads are explicitly only inserting records that do not already exist in the target database. (For upsert, we're basically just deleting any records that exist in our set of records to insert before inserting).

    This has it's own problems, particularly because NZ doesn't like deletes and needs to have the tables constantly groomed to actually reclaim the space, not to mention the fact that our upserts might lose historical column data by deleting old records that had since been changed (we have other processes for loading slowly changing dimensions that we want to track.)

    Anyway, the constraint stored proc looks like:

    check_constraints('table','constraint_type') returns boolean
    

    and basically loops through:

      select constraintname, contype, attname
        from _v_relation_keydata
       where relation = upper(p_table) order by constraintname, conseq;
    

    To get the columns that should be compared. For each constraint, it then runs something like:

      select count(*) 
        from ( 
        select distinct col1,col2,...
          from p_table 
         where col1 is not null 
           and col2 is not null... );
    

    and compare that number to

      select count(*) 
        from p_table 
       where col1 is not null 
         and col2 is not null...;
    

    If they're different, we raise an exception.