So I was looking for an algorithm to compare text contents, and I found this site called Copyscape that has this very handy tool used for comparing articles (link). It seems to do a good job in detecting the similarity value (by percentage) between 2 text documents. I want to know which algorithm do they use in that tool, or maybe something similar to it? Thanks in advance.
I am not sure how copyscape plagiarism works. But if you ask me to implement one.
I will start with - Define 'plagiarism'? content-1 and content-2 are nearly similar. Let us say >80% are same. i.e content-1 is taken 20% is changed to produce content-2.
Now, Let us try to solve: what will be cost (no.of changes) to convert content-1 to content-2? This is a well know problem in DP(dynamic programming world) as Levenshtein distance or EDIT Distance problem. The standard problem talks about strings distance, but you can easily modify it for words instead of chars. Additionally, you may need to track all the changes @ line #, word position on both contents.
Now, the above problem will give you Least no.of changes for conversion of content-1 to content-2. With the total length of content-1, we can easily calculate the % of changes to go to content-2 from content-1. If it below a fixed threshold (say 20%) then declare the plagiarism. Also, with the auxiliary information on line#, word position on both contents - You can show the changes made.