Best way to test SQL queries

I have run into a problem wherein we keep having complex SQL queries go out with errors. Essentially this results in sending mail to the incorrect customers and other 'problems' like that.

What is everyone's experience with creating SQL queries like that? We are creating new cohorts of data every other week.

So here are some of my thoughts and the limitations to them:

Creating test data Whilst this would prove that we have all the correct data it does not enforce the exclusion of anomalies in production. That is data that would be considered wrong today but may have been correct 10 years ago; it wasn't documented and therefore we only know about it after the data is extracted.
Create Venn diagrams and data maps This seems to be a solid way to test the design of a query, however it doesn't guarantee that the implementation is correct. It gets the developers planning ahead and thinking of what is happening as they write.

Thanks for any input you can give to my problem.

Solution

You wouldn't write an application with functions 200 lines long. You'd decompose those long functions into smaller functions, each with a single clearly defined responsibility.

Why write your SQL like that?

Decompose your queries, just like you decompose your functions. This makes them shorter, simpler, easier to comprehend, easier to test, easier to refactor. And it allows you to add "shims" between them, and "wrappers" around them, just as you do in procedural code.

How do you do this? By making each significant thing a query does into a view. Then you compose more complex queries out of these simpler views, just as you compose more complex functions out of more primitive functions.

And the great thing is, for most compositions of views, you'll get exactly the same performance out of your RDBMS. (For some you won't; so what? Premature optimization is the root of all evil. Code correctly first, then optimize if you need to.)

Here's an example of using several view to decompose a complicated query.

In the example, because each view adds only one transformation, each can be independently tested to find errors, and the tests are simple.

Here's the base table in the example:

create table month_value( 
    eid int not null, month int, year int,  value int );

This table is flawed, because it uses two columns, month and year, to represent one datum, an absolute month. Here's our specification for the new, calculated column:

We'll do that as a linear transform, such that it sorts the same as (year, month), and such that for any (year, month) tuple there is one and only value, and all values are consecutive:

create view cm_absolute_month as 
select *, year * 12 + month as absolute_month from month_value;

Now what we have to test is inherent in our spec, namely that for any tuple (year, month), there is one and only one (absolute_month), and that (absolute_month)s are consecutive. Let's write some tests.

Our test will be a SQL select query, with the following structure: a test name and a case statement catenated together. The test name is just an arbitrary string. The case statement is just case when test statements then 'passed' else 'failed' end.

The test statements will just be SQL selects (subqueries) that must be true for the test to pass.

Here's our first test:

--a select statement that catenates the test name and the case statement
select concat( 
-- the test name
'For every (year, month) there is one and only one (absolute_month): ', 
-- the case statement
   case when 
-- one or more subqueries
-- in this case, an expected value and an actual value 
-- that must be equal for the test to pass
  ( select count(distinct year, month) from month_value) 
  --expected value,
  = ( select count(distinct absolute_month) from cm_absolute_month)  
  -- actual value
  -- the then and else branches of the case statement
  then 'passed' else 'failed' end
  -- close the concat function and terminate the query 
  ); 
  -- test result.

Running that query produces this result: For every (year, month) there is one and only one (absolute_month): passed

As long as there is sufficient test data in month_value, this test works.

We can add a test for sufficient test data, too:

select concat( 'Sufficient and sufficiently varied month_value test data: ',
   case when 
      ( select count(distinct year, month) from month_value) > 10
  and ( select count(distinct year) from month_value) > 3
  and ... more tests 
  then 'passed' else 'failed' end );

Now let's test it's consecutive:

select concat( '(absolute_month)s are consecutive: ',
case when ( select count(*) from cm_absolute_month a join cm_absolute_month b 
on (     (a.month + 1 = b.month and a.year = b.year) 
      or (a.month = 12 and b.month = 1 and a.year + 1 = b.year) )  
where a.absolute_month + 1 <> b.absolute_month ) = 0 
then 'passed' else 'failed' end );

Now let's put our tests, which are just queries, into a file, and run that script against the database. Indeed, if we store our view definitions in a script (or scripts, I recommend one file per related views) to be run against the database, we can add our tests for each view to the same script, so that the act of (re-) creating our view also runs the view's tests. That way, we both get regression tests when we re-create views, and, when the view creation runs against production, the view will also be tested in production.