python unit-testing testing regression-testing

Regression testing when "test oracle" is an informal output comparison

I maintain a Python program that provides advice on certain topics. It does this by applying a complicated algorithm to the input data.

The program code is regularly changed, both to resolve newly found bugs, and to modify the underlying algorithm.

I want to use regression tests. Trouble is, there's no way to tell what the "correct" output is for a certain input - other than by running the program (and even then, only if it has no bugs).

I describe below my current testing process. My question is whether there are tools to help automate this process (and of course, if there is any other feedback on what I'm doing).

The first time the program seemed to run correctly for all my input cases, I saved their outputs in a folder I designated for "validated" outputs. "Validated" means that the output is, to the best of my knowledge, correct for a given version of my program.

If I find a bug, I make whatever changes I think would fix it. I then rerun the program on all the input sets, and manually compare the outputs. Whenever the output changes, I do my best to informally review those changes and figure out whether:

the changes are exclusively due to the bug fix, or
the changes are due, at least in part, to a new bug I introduced

In case 1, I increment the internal version counter. I mark the output file with a suffix equal to the version counter and move it to the "validated" folder. I then commit the changes to the Mercurial repository.

If in the future, when this version is no longer current, I decide to branch off it, I'll need these validated outputs as the "correct" ones for this particular version.

In case 2, I of course try to find the newly introduced bug, and fix it. This process continues until I believe the only changes versus the previous validated version are due to the intended bug fixes.

When I modify the code to change the algorithm, I follow a similar process.

Solution

Here's the approach I'll probably use.

Have Mercurial manage the code, the input files, and the regression test outputs.
Start from a certain parent revision.
Make and document (preferably as few as possible) modifications.
Run regression tests.
Review the differences with the parent revision regression test output.
If these differences do not match the expectations, try to see whether a new bug was introduced or whether the expectations were incorrect. Either fix the new bug and go to 3, or update the expectations and go to 4.
Copy the output of regression tests to the folder designated for validated outputs.
Commit the changes to Mercurial (including the code, the input files, and the output files).