algorithm project cosine-similarity sentence-similarity

What's the best algorithm to check the similarity percentage among the submitted assignments?

I am planning to build a project for final year that is similar to similarity checker. In the project, I am planning to check the similarity percentage among the submitted assignments i.e offline.

For example:

When the first student submits an assignment, it is not checked with any other assignments.
When the second student submits an assignment, it is checked with the first assignment.
When the third student submits an assignment, it is checked with first and second submitted assignments.
Similarly, if there are 35 students thEn the 36th submitted assignment is checked with the rest of the 35 submitted assignments.

Now, here comes the question that how to compare two assignments. In this case comparison is similarity between the texts in the documents. I want the result similar to this:

I just want to show the percentage of similar sentences and what they are?

What I did:

I studied different algorithms like td-idf, cosine similarity algorithm but I am unable to correctly interpolate the results of the algorithm.

So, I want to know which algorithm is the best in this situation and I want to know how this is done. Are there any references to any sites, blogs that would help me?

Solution

It depends on how the algorithm you use returns comparison results.

For example, the following function compares a list of document contents and returns a dictionary of document pairs mapped to the list of common word sequences between them. It does not discriminate between word sequences that are included in each other because there may be a different number of occurrences of longer and shorter word sequence that overlap.

import re
from itertools import combinations

def wordList(document): return re.findall("(\w+|\d+)",document.lower())

def compareDocs(documents, minSize=2, maxSize=25):
    result  = dict() # { (documentIndex,documentIndex) : [CommonExpressions] }
    def tallyDuplicates(expressionDocs):
        for expression,docIndexes in expressionDocs.items():
            for docIndex,otherDoc in combinations(docIndexes,2):
                result.setdefault((docIndex,otherDoc),[]).append(expression)

    documentWords    = [ wordList(document) for document in documents ]
    wordCounts       = [ len(words) for words in documentWords ]
    expressionRanges = dict()
    for docIndex,words in enumerate(documentWords):
        for wordIndex,word in enumerate(words):
            expressionRanges.setdefault(word,[]).append((docIndex,wordIndex))

    size = 1    
    while size == 1 or expressionDocs and size <= maxSize:        
        nextExpressions   = dict()
        expressionDocs    = dict()
        for expression,starts in expressionRanges.items():
            for docIndex,startIndex in starts:
                endIndex = startIndex+size
                if endIndex >= wordCounts[docIndex]: continue
                extended = " ".join([expression,documentWords[docIndex][endIndex]])
                expressionDocs.setdefault(extended,set()).add(docIndex)
                nextExpressions.setdefault(extended,[]).append( (docIndex,startIndex) )
        expressionDocs   = { expression:docIndexes for expression,docIndexes in expressionDocs.items() if len(docIndexes) > 1 }
        expressionRanges = { expression:ranges for expression,ranges in nextExpressions.items() if expression in expressionDocs }  
        if size >= minSize: tallyDuplicates(expressionDocs)
        size += 1

    return result

Based on these comparison results, you need to analyze the contents of each document pair to count words covered by the common expressions (word sequences). Given that an expression contains several words, each expression will account for multiple words in the similarity ratio : words-in matching-expressions / words-in-document.

[EDIT] I Placed the result analysis in its own function and added an html output to highlight the expressions in the document texts:

def analyzeComparison(doc1,doc2,commonExpr):
    words1  = wordList(doc1)
    words2  = wordList(doc2)
    normalizedDoc1 = " ".join(words1)
    normalizedDoc2 = " ".join(words2)
    expressions.sort(key=lambda s:len(s),reverse=True)
    matches = []
    for expression in expressions:
        count1 = len(re.findall(expression,normalizedDoc1))
        count2 = len(re.findall(expression,normalizedDoc2))
        commonCount = min(count1,count2)
        if commonCount == 0: continue
        expressionId = "<#"+str(len(matches))+"#>"
        normalizedDoc1 = re.sub(expression,expressionId,normalizedDoc1,commonCount)
        normalizedDoc2 = re.sub(expression,expressionId,normalizedDoc2,commonCount)
        matches.append((expression,commonCount))
    commonWords = sum( count*len(expr.split(" ")) for expr,count in matches)
    percent1 = 100*commonWords/len(words1)
    percent2 = 100*commonWords/len(words2)
    for index,match in enumerate(matches):
        expressionId = "<#"+str(index)+"#>"
        expressionHighlight = "<span style='background-color:yellow'>"+match[0]+"</span>"
        normalizedDoc1 = re.sub(expressionId,expressionHighlight,normalizedDoc1)
        normalizedDoc2 = re.sub(expressionId,expressionHighlight,normalizedDoc2)
    return (percent1,percent2,matches,normalizedDoc1,normalizedDoc2)

For example: If you have the following 3 documents (you would typically read them from files):

doc1 = """
Plagiarism, one of the main scourges of the academic life, is quite an easy concept, but, nonetheless, harmful. In short, to plagiarize means to steal someone else’s idea or part of work and use it as your own. But why exactly it is considered to be so bad and immoral? And it is really considered immoral and a serious offence. In case it is discovered, it may lead to very unpleasant consequences; the higher the position of the offender is, the more unpleasant they are.
copy and paste
There are two major kinds of harm plagiarism causes. First, it is something as simple as stealing and lying – you just steal someone else’s work and trick somebody into believing it was you who had written it, which is as immoral as any other kind of theft is. It means that somebody had actually spent time and effort in order to create something, while you did nothing but ripping it off and submitting it.
copy and paste function
Second, it is a crime you commit against yourself. If you study at an educational institution, there are certain tasks copy and paste you are given in order to ensure that you learn something. When you resort to plagiarism, you undo all these efforts for, instead of actually doing something and understanding it in process, you use someone else’s work and the certain amount of experience that you were supposed to get just misses you.
"""
doc2 = """
Plagiarism has always been a problem in schools. However, with the invention of the internet,copy and paste  it has made plagiarism even more of a challenge. Plagiarism.org, “estimates that nearly 30 percent of all students may be plagiarizing on all their written assignments and that the use of the Internet has made plagiarism much worse.” [1] The act of plagiarism can be defined as, “To steal and pass off (the ideas or words of another) as one’s own, to use (another’s production) without crediting the source, to commit literary theft, to present as new and original as idea or product derived from an existing source”2. Plagiarism has become such a concern for colleges that almost all the sites on this topic are sponsored by schools. The three main topics with plagiarism are the copy and paste function, “paper mills” and the ways that can be used to prevent students from doing this. 
it is quite an easy concept
The first major concern with the internet would be the copy and paste function. Wittenberg copy and paste function lists that “Widespread availability of the internet and increased access to full text databases has made cut and paste plagiarism very easy”.3 While the function is actually very nice to have, people are using it the wrong way. Instead of just using it to copy quotes from websites, than pasting it to their word document and giving it the proper credit, people are passing it off as their own. This is where the problem occurs.
"""

doc3 = """
Plagiarism has always been a problem in schools. However, it is something as simple as stealing and lying
it is a crime you. some other text
"""

You would first call the compareDocs() on the list of document contents and, for each pair of document (returned by the function), you would use analyzeComparison() to obtain percentages, counts and highlights :

documents   = [doc1,doc2,doc3]
comparisons = compareDocs( documents )
for documentPair,expressions in comparisons.items():
    docIndex1,docIndex2 = documentPair
    doc1 = documents[docIndex1]
    doc2 = documents[docIndex2]        
    pct1,pct2,matches,doc1,doc2 = analyzeComparison(doc1,doc2,expressions)

    # print result on console ...
    print(int(pct1//1)," % of document #",docIndex1," is same as document #", docIndex2)
    print(int(pct2//1)," % of document #",docIndex2," is same as document #", docIndex1)
    print("Common expressions are:")
    for expression,count in matches:
        print( "    ",expression,"(",count,"times )")
    print("")

    # output comparison result to an HTML file...
    htmlPage = "<html><body><table border='1'>"
    htmlPage += "<tr><th>#" + str(docIndex1) + ": Source " + str(int(pct1//1)) + "% duplicate</th>"
    htmlPage += "<th>#" + str(docIndex2) + ": Target  " + str(int(pct2//1)) + "% duplicate</th></tr>"
    htmlPage += "<tr><td width='50%' valign='top'>" + doc1 + "</td><td valign='top'>" + doc2 + "</td></tr>"
    htmlPage +="</table></body></html>"        
    fileName = str(docIndex1)+"-"+str(docIndex2)+".html"
    with open(fileName,"w") as f: f.write(htmlPage)

This prints the following information and creates a bunch of HTML files that look similar to your desired result:

3.0  % of document # 1  is same as document # 2
34.0  % of document # 2  is same as document # 1
Common expressions are:
     plagiarism has always been a problem in schools however ( 1 times )

6.0  % of document # 0  is same as document # 1
5.0  % of document # 1  is same as document # 0
Common expressions are:
     is quite an easy concept ( 1 times )
     copy and paste function ( 1 times )
     copy and paste ( 2 times )

5.0  % of document # 0  is same as document # 2
53.0  % of document # 2  is same as document # 0
Common expressions are:
     it is something as simple as stealing and lying ( 1 times )
     it is a crime you ( 1 times )

To summarize, the whole process works as follows:

1) run the comparison function to identify the expressions (sequence of words) that are common to each pair of documents.

The compareDocs function does this in one call given a list of document texts.
If you use a different comparison algorithm, it may be designed to only perform the comparison between two documents or, in the case of classifiers, it may simply return a list of word/term frequencies for one document.
depending on the algorithm's input and output, you will need to wrap the logic in more or less of your own code to obtain the desired result
What you should be looking for at this stage is a list of common expressions (word sequences) between distinct pairs of documents.
If you are working with an algorithm that only extracts term frequencies (e.g. td-idf), you will have a high complexity problem on your hand to cross-match the term frequencies between document pairs.

For example, a classifier may return frequencies: "cut"=25 times, "and"=97 times "paste"=31 times for a given document. This gives you no indication that the expression "cut and paste" is actually present in the document or how many times it would be there. The document could be talking about tooth paste and never have these 3 words in sequence. Comparing documents based only on word frequencies will find a high correlation between essays on the same topic but this doesn't mean there was plagiarism.

Also, even if your classifier manages to return all expressions of two words or more, each documents will produce close to w*2^n expressions where w is the number of words in the document and n is the maximum length of expressions in number of words (a maximum that you will have to decide). This will easily reach millions per documents and then you will need to match them with the millions in other documents. This may not be an issue if you have Google's resources but it would be for the rest of us.

2) To measure the % of similarity between documents, you will need to locate the common expressions on both sides and measure how many of each document's words are covered by common expressions.

Identifying the location of expressions is a simple text search process
You will, however need to avoid counting any given word more than once because the denominator of your percentage is the number of word in the document (and you don't want to overestimate or go over 100%)
This can be achieved by processing longer expressions first and removing them from the text (or masking them) so that their words are not counted again by subsequent (shorter) expressions
The analyzeComparison() function masks the expressions that are found in the text by replacing them with a placeholder that will later be used to re-inject the text with highlighting tags (HTML).

3) Make use of the document comparison analysis in your own program. This depends on how you want to present the information and whether you need to store the results or not (up to you). You could, for example, decide on a threshold of similarity and only output document pairs that are suspicious. This threshold could be based on the percentages, number of common expressions, maximum or average length of common expressions, etc.

[EDIT2] How compareDocs works ...

The function creates a dictionary of expressions that maps them to the position of their first word in each document. This is stored in the expressionRanges variable.
- example: { "copy and paste":[ (0,57), (1,7),(1,32) ] .... }
- This means that the 3 word expression "copy and paste" is found in document #0 at position 57 (position of the word "copy") and in document #1 at positions 7 and 32.
The expression dictionary (expressionRanges) starts out with one-word expressions and uses this to get to 2-word expressions, then 3-word, and so on.
Before moving on to the next expression size, the expression dictionary is cleaned up by removing all expressions that are only found in one document.
- size 1 ==> { "copy":[ (0,57),(0,72),(1,7),(1,32),(1,92) ] ... }
- clean-up ...
- size 2 ==> { "copy and": [ (0,57),(1,7),(1,32),(1,92) ] ... }
- clean-up ...
- size 3 ==> { "copy and paste": [ (0,57),(1,7),(1,32) ] ... }
This clean-up is achieved by creating a separate dictionary (expressionDocs) that maps expressions to a set of document indexes that contain the expression. Expressions that end up with only one document in their set are removed from both dictionaries.
The expressionDocs dictionary is also used to produce the output of the function. Expressions that appear in more than one document are mapped to document pairs (combinations of 2) to form a dictionary containing: { (document pair):[list of expressions] } (which is the result of the function)
The tallyDuplicates sub-function performs the transposition from {Expression:[list of document indexes]} to {(document pair):[list of expressions]} by adding the expression to every combination of 2 in the list of document indexes.

The successive refinement of expressionRanges considerably reduces the number of word matches to perform. Each pass merely adds one word to each expression and is immediately cleaned up before going to the next expression size. The expressionRanges dictionary starts out with as many entries as there are distinct words in the documents but is rapidly reduced to a much smaller size (unless the documents are practically identical).

One drawback of this approach is that documents that have a large number of very long matching expressions will cause the dictionary to grow instead of shrinking and the while loop will run much longer. The worst case scenario would be two identical documents. This could be avoided by introducing a maximum expression size to make the loop stop earlier. If, for example, you set a maximum size of 25, the function will only report a 25 word common expression and a 5 word common expression instead of a 30 word expression if there is one. This may be an acceptable compromise to avoid the very long processing times that nearly identical documents will entail. As far as the percentage of similitude is concerned, the difference would be minimal. (i.e. one common word may be ignored if there is a 26 word common expression and the maximum is 25 but a 27-word expression would come out as a 25word and a 2word match)