algorithm duplicates text-parsing code-duplication approximation

Duplicate Programs

I want to write a code that can find similarity between code files (maybe find similarity by percentage or at least "guess" which files could of been copied), I run it for 30 files and maximum 500 lines in each file. I want to identify duplicate files (or the ones that are suspected to be duplicated).

I encounter several problems:

spacing: one code can have multiple spaces or line breaks
comments: file with comments against file without comments or different comments

this 2 problems I thaught I can solve by removing all spaces and line breaks and comments from the code but then I encounter the following

files that try to "hide" the similarity, consider the following 2 C files as an example

Code 1:

void main()
{
    int x;
    int y;
    scanf("%d", &x);
    switch(x)
    {
        case 1:
        //some code
        break;

        case 2:
        //some code
        break;
    }
}

Code 2:

#define ONE 1
#define TWO 2
void main()
{
    int a, b;
    scanf("%d", &a);
    switch(a)
    {
        case ONE:
        //some code
        break;

        case TWO:
        //some code
        break;
    }
}

I would appriciate any help (maybe with existing tools or by suggesting an algorithm)

Thanks.

Solution

You might be interested in looking at MOSS, a system developed at Stanford which attempts to solve exactly your problem.

If you're curious about developing your own approach, however, here's some ideas to address the issues you mentioned so far:

Parse the code into an AST, so that you can easily manipulate code as a data structure and ignore issues like whitespace.
You can detect changes in variable names by renaming the variables yourself, using some scheme which guarantees a unique naming based on order of declaration and scoping. For some inspiration, see De Bruijn indices.