python unit-testing testing python-unittest static-code-analysis

Detecting incorrect assertion methods

During one of the recent code reviews, I've stumbled upon the problem that was not immediately easy to spot - there was assertTrue() used instead of assertEqual() that basically resulted into a test that was testing nothing. Here is a simplified example:

from unittest import TestCase


class MyTestCase(TestCase):
    def test_two_things_equal(self):
        self.assertTrue("a", "b")

The problem here is that the test would pass; and technically, the code is valid, since assertTrue has this optional msg argument (that gets the "b" value in this case).

Can we do better than rely on the person reviewing the code to spot this kind of problems? Is there a way to auto-detect it using static code analysis with flake8 or pylint?

Solution

Several years ago I came up with a general approach/methodology to assuring the quality of tests. The specification of a test can be reduced to two clauses:

It must pass for correct implementation of the feature being tested, and
It must fail for incorrect/broken implementation of the feature being tested

To the best of my knowledge, while the requirement 1. is routinely being exercised, little attention is being paid to requirement 2.

Typically

a test-suite is created,
the code is run against it,
any failures (because of bugs either in the code or in the tests) are fixed
and we arrive at a situation when we believe that our code and tests are good.

The actual situation may be that (some of) the tests contain bugs that (would) prevent them from catching bugs in the code. Therefore, seeing tests pass shouldn't suggest much tranquility to the person caring about the quality of the system, until they are confident that the tests are indeed able to detect the problems they were designed against¹. And a simple way to do it is to actually introduce such problems and check that they don't remain unnoticed by the tests!

In TDD (test-driven development), this idea is followed only partially - the recommendation is to add the test before the code, see it fail (it should, since there is no code yet) and then fix it by writing the code. But failure of a test because of missing code doesn't automatically mean that it will also fail in case of buggy code (this seems to be true for your case)!

So the quality of a test suite can be measured as a percentage of bugs that it would be capable of detecting. Any reasonable² bug that escapes a test-suite suggests a new test case covering that scenario (or, if the test suite should have caught that bug, a bug in the test suite is uncovered). This also means that every test of the suite must be able to catch at least one bug (otherwise, that test is completely pointless).

I was thinking about implementing a software system that facilitates adopting this methodology (i.e. allows injecting and maintaining artificial bugs in the code base and checks how the tests respond to them). This question acted as a trigger that I am going to start working on it right away. Hoping to put something together within a week. Stay tuned!

EDIT

A prototype version of the tool is now available at https://bitbucket.org/leon_manukyan/trit. I recommend cloning the repository and running the demo flow.

¹ A more generalized version of this statement is true for a wider range of systems/situations (all typically having to do with security/safety):

A system designed against certain events must be routinely tested against such events, otherwise it is prone to degradation down to complete inability to react against the events of interest.

Just an example - do you have a fire alarm system at home? When did you witness it work last time? What if it stays silent during fire too? Go make some smoke in the room right now!

² Within the scope of this methodology, a back-door like bug (e.g. when the feature misbehaves only if the passed in URL is equal to https://www.formatmyharddrive.com/?confirm=yesofcourse) is not a reasonable one