I'm search for a tool that could compare source codes for similarity.
We have a very trivial system right now that has huge amount of false positives and the real positives can easily get buried in them.
My requirements are:
To avoid confusion, the following two code snippets are identical and should be detected as such:
for (int i = 0; i < 10; i++) { bla; }
int i; while (i < 10) { bla; i++; }
The same here:
int x = 10; y = x + 5;
int a = 10; y = a + 5;
I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.
Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf
If you google "measure software similarity", you should find a few more useful hits: http://web.archive.org/web/20150219121637/http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html