I'm looking for a good algorithm / method to check the data quality in a data warehouse. Therefore I want to have some algorithm that "knows" the possible structure of the values and then checks if the values are a member of this structure and then decide if they are correct / not correct.
I thought about defining a regexp and the check each value whether it fits or not.
Is this a good way? Are there some good alternatives? (Any research papers?)
I have seen some authors suggest adding a special dimension called a data quality dimension to describe each facttable-record further.
Typical values in a data quality dimension could then be “Normal value,” “Out-of-bounds value,” “Unlikely value,” “Verified value,” “Unverified value,” and “Uncertain value.”