Search code examples
pythonurllibplagiarism-detection

How to search for occurrence of a word/phrase within webpage?


My end goal here is to create a primitive plagiarism checker given a text file. I plan to do this by first splitting the data by sentence, searching each sentence on Google, and finally searching each of the first few URL's returned by Google for occurrences of the sentence/substrings. This last step is the one I'm having trouble with.

When running through each URL in a for-loop, I first read the contents of the URL using urllib.open(), but I'm not sure what to do after. Code is attached below, with some solutions I've tried commented out. I've imported the googlesearch, urllib.request, and re libraries.

def plagCheck():

    global inpFile

    with open(inpFile) as data:
        sentences = data.read().split(".")

    for sentence in sentences:
        for url in search(sentence, tld='com', lang='en', num=5, start=0, stop=5, pause=2.0):
            content = urlopen(url).read()

            # if sentence in content:
            #     print("yes")
            # else:
            #     print("no")

            # matches = findall(sentence, content)
            # if len(matches) == 0:
            #     print("no")
            # else:
            #     print("yes")

           


Solution

  • If I understand your code correctly, you now have two Python lists of sentences. It looks like you have split them using a period. This would create fairly large run-on sentences for other types of punctuation (?, !).

    I would consider using a similarity checker library. Diflibb has a simliar class Then decide on some percentage to flag i.e. if it's 40% the same. This reduces the amount of content you have to check manually.

    Expanding the number of punctuations. That might look something like this:

    with open(inpFile) as data:
            # Replace all !, ? with .
            sentences = data.read().replace("!", ".").replace("?", ".").split(".")
    

    Then I would write your results for this file back to a new output file, something like this

    # loop each sentence and run it through google
    # Compare those two sentences with the sequence matcher linked above (Difflib) 
    # Add them to a dictionary with the percent, url, and sentence in question
    # Sample result
    results = {"sentence_num": 0, "percent": 0.8, "url": "the google url found on", "original_sentence": "Red green fox over the wall"
    }
    outputStr = "<html>"
    # loop the results and format the dictionary in a way that you can read. Ideally an HTML table with columns representing the keys above
    outputStr += "<table>" # etc
    with open(outputFile) as results:
       results.write(outputStr)
    
    
    
    
    

    You could even go as far as to highlight table rows based on the percentage i.e.

    80% and above is red 61-79% orange 40-60% yellow 39% and below is green