I want to write a code that can find similarity between code files (maybe find similarity by percentage or at least "guess" which files could of been copied), I run it for 30 files and maximum 500 lines in each file. I want to identify duplicate files (or the ones that are suspected to be duplicated).
I encounter several problems:
this 2 problems I thaught I can solve by removing all spaces and line breaks and comments from the code but then I encounter the following
Code 1:
void main()
{
int x;
int y;
scanf("%d", &x);
switch(x)
{
case 1:
//some code
break;
case 2:
//some code
break;
}
}
Code 2:
#define ONE 1
#define TWO 2
void main()
{
int a, b;
scanf("%d", &a);
switch(a)
{
case ONE:
//some code
break;
case TWO:
//some code
break;
}
}
I would appriciate any help (maybe with existing tools or by suggesting an algorithm)
Thanks.
You might be interested in looking at MOSS, a system developed at Stanford which attempts to solve exactly your problem.
If you're curious about developing your own approach, however, here's some ideas to address the issues you mentioned so far: