What duplication detection threshold do you use?

We all agree that duplication is evil and should be avoid (Don't Repeat Yourself principle). To ensure that, static analysis code should be used like Simian (Multi Language) or Clone Detective (Visual Studio add-in)

I just read Ayende's post about Kobe where he is saying that :

8.5% of Kobe is copy & pasted code. And that is with the sensitivity dialed high, if we set the threshold to 3, which is what I commonly do, is goes up to 12.5%.

I think that 3 as threshold is very low. In my company we offer quality code analysis as a service, our default threshold for duplication is set to 20 and there is a lot of duplications. I can't imagine if we set it to 3, it would be impossible for our customer to even think about correction.

I understand Ayende's opinion about Kobe: it's an official sample and is marketed as “intended to guide you with the planning, architecting, and implementing of Web 2.0 applications and services.” so the expectation of quality is high.

But for your project what minimum threshold do you use for duplication?

Solution

Three is a good rule of thumb, but it depends. Refactoring to eliminate duplication often involves trading conceptual simplicity of the codebase and API for a smaller codebase that is more maintainable once someone does understand it. I generally evaluate things in this light.

At one extreme, if fixing the duplication makes the code more readable and adds little or nothing to the conceptual complexity of the code, then any duplication is unacceptable. An example of this would be whenever the duplicated code factors out neatly into a simple referentially transparent function that does something that's easy to explain and name.

When a more complex, heavyweight solution, such as metaprogramming, OO design patterns, etc. is necessary, I may allow 4 or 5 instances, especially if the duplicated snippet is small. In these cases I feel that the conceptual complexity of the solution makes the cure worse than the ill until there are really a lot of instances.

In the most extreme case, where the codebase I'm working with is a very rapidly evolving prototype and I don't know enough about what direction the project may evolve in to draw abstraction lines that are both reasonably simple and reasonably future-proof, I just give up. In a codebase like this, I think it's better to just focus on expediency and getting things done than good design, even if the same piece of code is duplicated 20 times. Often the parts of the prototype that are creating all that duplication are the ones that will be discarded relatively soon anyhow, and once you know what parts of the prototype will be kept, you can always refactor these. Without the additional constraints created by the parts that will be discarded, refactoring is often easier at this stage.