Testing FPGA Designs at Different Levels

Various aspects of test strategies for FPGAs have been discussed here on SO but I can't find that the following question has been asked/discussed/answered:

At what levels should you simulate your FPGA design and what do you verify at each level?

If you answer using concepts such as x-level testing where x = block, subsystem, function, or something else, please describe what x is for you. Something like typical size, complexity, or an example.

Sep 14

Both given answers are the same when it comes to the actual question but I'll accept the answer from @kraigher since it's the shortest one.

Sep 10

This is a summary and a comparison of the two answers from @Paebbles and @kraigher. One of the answers is very long so hopefully this will help anyone that want to contribute with their own answer. Remember that there's a bounty at stake!

They both simulate all components at all levels. At least @Paebbles makes exceptions for components that have very little functional content, for example a MUX.
They both strive for test automation
They both develop "tools" to simplify board-level testing
They both avoid testing things at one level that have already been tested at the level below
The biggest difference seems to be in how often the testbenches are simulated. @Paebbles tests directly in HW unless there are major design changes in which case the simulations are run as well. @kraigher runs the simulations more continuously as the design evolves. I think this is a very important question as well and personally I prefer the way @kraigher expressed. However, this was not part of the original question so I think there is consensus between the two answers. The question on how often tests should be run has also been discussed before on SO, for example how often should the entire suite of a system's unit tests be run?

There is a difference in how much in-lab testing they do but it seems mostly related to the specific circumstances for the projects (how many things that can't be effectively tested with simulations). I happen to know a bit about @kraigher last project so I can say that both projects are in the 1+ year category. It would be interesting to hear a story from someone with a smaller project. From what I've seen far from all projects are as complete with respect to functional coverage in simulations so there must be other stories.

Sep 7

This is a number of follow-up questions to @peabbles too long to fit among the comments.

Yes @peabbles, you have provided much of what I was looking for but I still have extra questions. I'm afraid that this may be a lengthy discussion but given the amount of time we spend on verification and the various strategies people apply I think it deserves a lot of attention. Hopefully we will have some more answers such that various approaches can be compared. Your bounty will surely help.

I think your story contains many good and interesting solutions but I'm an engineer so I will focus on the pieces I think can be challenged ;-)

You've spend a lot of time testing on hardware to address all the external issues you had. From a practical point of view (since they were not going to fix their SATA standard violations) it's like having a flawed requirement spec such that you develop a design solving the wrong problem. This is typically discovered when you “deliver” which motivates why you should deliver frequently and discover the problems early as you did. I'm curious about one thing though. When you discovered a bug in the lab that needed a design change would you then update the testbenches at the lowest level where this could be tested? Not doing that increases the risk that the bug reappears in lab and over time it would also degrade the functional coverage of your testbenches making you more dependent on lab testing.

You said that most testing was done in the lab and that was caused by the amount of external problems you had to debug. Is your answer the same if you just look at your own internal code and bugs?

When you're working with long turnaround times like you did you find various ways to make use of that time. You described that you started to synthesize the next design when the first was being tested and if you found a bug in one drive you started to synthesize a fix for that one while continuing to test other drives with the current design. You also described problems with observability when testing in the lab. I'm going to do a number of sceptical interpretations of this, you have to provide the positive ones!

If you could synthesize the next design immediately when you started to test the first it seems like you were working with very small increments but still made the effort to run every test at every level all the way to hardware. This seems a bit overkill/expensive, especially when you're not fully automated on the hardware testing. Another sceptical interpretation is that you're looking for a bug but due to poor observability you are producting random trial and error type of builds hoping that they will give clues to the problem you're trying to isolate. Was this really effective use of time in the sense that every build added value or was it more “doing something is better than doing nothing”?

When designing the higher protocol layers did you consider to short circuit the communication stack on the higher levels to speed up the simulation? After all, the lower layers were already tested.

You reused some components and assumed them to be bug free. Was that because they were shipped with testbenches proving that? Proven in use tend to be weak since reuse often happens in another context. The Arianne 5 rocket is a spectacular example, your reuse of XAPP 870 for Virtex 5 another.

Since you have simulations at various levels I would assume that you value the faster run times at the lower levels and the shorter feedback loop you have when you can verify a piece of your design before the larger structure has been completed. Still you had pieces of code that were significant enough to be awarded their own components but still too simple to be awarded their own testbenches. Can you give an example of such a component? Were they really bug free? Personally I don't write many lines of code before I make a mistake so if I have a well packaged piece of code like a component I take the opportunity to test at that level for the reasons mentioned above.

Solution

I perform behavioral simulation at all levels. That is all entities should have one corresponding test bench aiming for full functional coverage. If specific details of entities A, B and C have already been tested in isolation in their corresponding test benches they do not have to be covered in the test bench for entity D which instantiates A, B and C which should focus on proving the integration.

I also have device or board level tests where the actual design is verified on the actual device or board. This is because you cannot trust a device level simulation when models start to become inexact and also it takes to long. In the real device hours of testing can be achieved instead of milliseconds.

I try to avoid to perform any post-synthesis simulation unless a failure occurs in the device level tests in which case I perform it to find bugs in the synthesis tool. In this case I can make a small wrapper of the post-synthesis netlist and re-use the testbench from the behavioral simulation.

I work very hard to avoid any form of manual testing and instead rely on test automation frameworks for both simulation and device level testing such that testing can be performed continuously.

To automate simulation I use the VUnit test automation framework which @lasplund and myself are the authors of.