Search code examples
pythonpandasairflowstandard-deviationdata-quality

Generalized Data Quality Checks on Datasets


I am pulling in a handful of different datasets daily, performing a few simple data quality checks, and then shooting off emails if a dataset fails the checks.

My checks are as plain as checking for duplicates in the dataset, as well as checking if the number of rows and columns in a dataset haven't changed -- See below.

assert df.shape == (1016545, 8)
assert len(df) - len(df.drop_duplicates()) == 0

Since these datasets are updated daily and may change the number of rows, is there a better way to check instead of hardcoding the specific number?

For instance, one dataset might have only 400 rows, and another might have 2 million. Could I say to check within 'one standard deviation' of the number of rows from yesterday? But in that case, I would need to start collecting previous days counts in a separate table, and that could get ugly.

Right now, for tables that change daily, I'm doing the following rudimentary check:

assert df.shape[0] <= 1016545 + 100
assert df.shape[0] >= 1016545 - 100

But obviously this is not sustainable.

Any suggestions are much appreciated.


Solution

  • Yes, you would need to store some previous information, but since you don't seem to care about perfectly statistically accurate I think you can cheat a little. If you keep the average number of records based on the previous samples, the previous deviation you calculated, and the number of samples you took you can get reasonably close to what you are looking for by finding the weighted average of the previous deviation with the current deviation.

    For example:

    If the average count has been 1016545 with a deviation of 85 captured over 10 samples, and today's count is 1016612. If you calculate the difference from the mean (1016612 - 1016545 = 67) then the weighted average of the previous deviation and the current deviation ((85*10 + 67)/11 ≈ 83).

    This makes it so you are only storing a handful of variables for each data set instead of all the record counts back in time, but this also means it's not actually standard deviation.

    As for storage, you could store your data in a database or a json file or any number of other locations -- I won't go into detail for that since it's not clear what environment you are working in or what resources you have available.

    Hope that helps!