java automated-tests integration-testing test-data data-driven

Expensive maintenance with automated test data

At my company, we have a growing set of integration test using JUnit in a Java Web application. Each test uses some specific external XMLs files to populate the database with the needed data to the test. The problem is:

When there is a change in the model, we take a long time to correct all XML files (we have hundreds of XML files, a lot of them with redundancy).
The complexity of creating an XML file manually discourages the programmer to explore different scenarios.
We don't have a link between the test data and the test (e.g. At the test I don't know the 'name' of the User inserted by the XML). We could hard code the information we need, but it would also increase the maintenance time to keep both XML and hard coded data synchronized.

Facing this problem, I started thinking in using the own system CRUD to generate test data for each test. At the beginning of each test I would run some methods to persist the desired data for the test. In my vision, it would resolve all the 3 problems since:

Changes to the model require changing the CRUD anyway, so it would take no longer to correct the test data.
It would be easier to build, test data because we would not worry about thing like matching id and foreign key of the entity manually.
I would have all the important data in the variables with the sync guaranteed by the IDE.

but, for me, it lacks experience and knowledge to start this approach. The question is: Is this solution effective? Does this approach cause other problems? Where I can find this approach in the literature? Is there a better solution to the listed problems?

Solution

The key to improving maintainability is to keep DRY. Test data setup should not be redundant, and if you test technology offers no effective means of reuse, you are using the wrong technology.

Writing Java code for the test data setup gives you familiar and good tools to improve code reuse across tests. It also offers better refactoring support than XML, and makes the link between test data and test code explicit, because that's in the very same source file (or even the same method!). However, it does require tests to be written and maintained by programmers (not business analysts, managers, or testers that do not know Java).

Therefore, if test data is mostly authored and maintained by programmers, I'd do so in Java, through the CRUD layer (or even a full fledged domain layer) of the real application. If however most test data originates from some data export, or is authored by people that are not programmers, a purely data driven approach can be a better fit. It is also possible to combine these approaches (i.e. choose the most appropriate strategy for each entity).

Personal experience: Our team used to do integration tests with DBUnit, but have switched to setting up the test data as part of the test code by using our real data access layer. In so doing, our tests became more intention revealing and easier to maintain. Test effort was reduced, but test coverage was improved, and more tests got written with less prodding. This was possible because the tests were entirely written and maintained by developers.