Search code examples
mysqldatabaserelational-databasedata-integrity

Risks of Non-Standardized Databases?


I've been given a task at work to go through 110K records of a personnel database and produce a subset using specific departmental exclusions.

I've been instructed that the risk tolerance on my work is 0. As in, I can not have an excluded employee in the resulting subset.

The problem is that the database I am using has non-standardized fields for the fields I am supposed to exercise my exclusion logic against!

For example, if I were trying to exclude a specific Department, say "Account Collections", I can't simply query the database for 'department <> "Account Collections"' as the department can also be set in 6 other fields, none of which are standardized, they are all free-form fields.

To complicate matters, "Account Collections" is sometimes spelled incorrectly, or abbreviated by the user such as "Acc Col", or "Acct Coll."...

I'm trying to explain to my colleagues that this is an impossible task with zero risk tolerance as I have no idea if I will catch all "Account Collections" departments across all the fields.

I need help getting this concept across to non-tech related people, what the heck should I do?!

Thank you for your experience!


Solution

  • In order to answer your question about getting the point across to non-tech related people. Its difficult, You basically need to put it in writing as simply as you can that the current configuration of data coming in to the database is in a inconsistent state.

    The end result is that you can attempt to create the subset of data but cannot guarantee that it will not contain errors. Explain that they can either accept this risk or create a secondary body of work to manually check all rows of the output data you generate.