Detecting code duplicate between files and making semiautomatic refactoring

It doesn't matter if the solution is represented by a framework, a tool or anyting else. The problem is pretty hard to solve I'm fighting against it since years.

I'll make an example to better clarify what I'm speaking of.

File1

<head>
<title>Fotografia Elenco Completo Filtri Professionali</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<META name="Language" content="it">
<META http-equiv="Revisit-After" content="2 days">
<style>
<!--
 table.MsoNormalTable
    {mso-style-parent:"";
    font-size:10.0pt;
    font-family:"Times New Roman"}
-->
</style>
</head>

File2

<head>
<title>Militari</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="keywords" content="militari, ....">
<meta name="robots" content="INDEX, FOLLOW">
<meta name="Language" content="it">
<meta http-equiv="Revisit-After" content="2 days">
<meta name="Rating" content="General">
<link rel="stylesheet" type="text/css" href="./file/stile.css">
<script language="JavaScript">

File 3

<head>
<title>Cinema - Recensioni e Trame di Film</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="keywords" content="recensioni film">
<meta name="description" content="Ottimo sito di recensioni di film, trame di film cinematografice, di Videogame e Romanzi. ">
<meta name="robots" content="INDEX, FOLLOW">
<meta name="Language" content="it">
<meta http-equiv="Revisit-After" content="2 days">
<meta name="Rating" content="General">
<link rel="stylesheet" type="text/css" href="file/stile.css">
<style type="text/css">
body {
    background-color:#F0F0F0;
    text-align: center;
}
</style>

For an human being the task of avoiding this kind of code duplication is obvious. He can recognize that "", "" are delimiters. That the order of line doesn't matter and which part can be put into variables (or stored as values on a database) and also which files are similar enough to be refactored.

The whole process would seem not be so terrible hard to automatize. But.. I couldn't find any solution until now. Even automatizing the recognizing of the delimiter is hard..

The best way I found is to play with regular expression tools and become mad :D

After refactoring

file1

header -> PrintHeader();

file2

header -> PrintHeader();

file3

header -> PrintHeader();

GlobalFile

class header
{
 function PrintHeader
 {
  SELECT title, content-type, language, revisit-after, rating, robots, extra_text_unparsed
  into myArray
  FROM header_table
  WHERE filename = $filename

 foreach(v in myArray)
 {
  echo ....
 }
 }
}

Any suggestion?

Solution

What you want is a clone detector.

See https://en.wikipedia.org/wiki/Duplicate_code. There's a list of clone detectors there.

The key issues are:

What language does the clone detector support?
How does it detect clones?
How can such clones be removed?
Does the tool provide automation for removing clones?

Pure "string clone detection" can be language independent, but typically cannot find removable clones because they don't understand boundaries between code fragments.

I build AST-based clone detectors. These detect clones based on the structure of the target language, as represented by the AST. Clones detected this way are much more natural with respect to language boundaries than other detectors. A downside: these are necessarily language dependent. You need a different detector for each language. The payoff is you get high-quality clones detected across large sets of code.

Removing clones automatically is hard; each langauge offers its own means for abstracting code (e.g., make a subroutine, macro, include file, ...), and the tool would have to know each of them. You invented an abstraction for HTML which is outside what HTML can code (putting fragments into a database: not in HMTL's vocabulary).

As a practical matter, there are basically no automated clone removers. Pretty much what you have to do is to identify the clones (this is why the clone detector is good) and then manually remove them, especially to get custom effects like the one you show.

If you want to implement an automated clone removal tool, you need what amounts to a program transformation system. (See my bio for one, that happens to also support clone detection).