Search code examples
plsqlautomationduplicatesstatic-code-analysis

finding duplicate source code


I'm analyzing some legacy code. It is about 80.000 lines of old plsql code. On a fist look there is quite some duplication in the source which needs to be removed. Instead off doing diff's manual and looking at each file there must be some tool/commandline confu out there to detect duplicate lines of source code.

My goal is to make an educated guess about the minimal size of a rewrite of source and about how much actual knowledge is captured in this program. I wrote some a basic static code analyzer to find the amount of control statements IF ELSE FOR etc and Functions in each file. But duplicated code still needs to be removed from my statistics.


Solution

  • Have you looked at Simian - Similarity Analyser? (Just checked and it's no longer free, but it is available for a period of 15 days for evaluation purposes.)

    Simian (Similarity Analyser) identifies duplication in Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even plain text files. In fact, simian can be used on any human readable files such as ini files, deployment descriptors, you name it.

    I have used it in practice and it does work well.