Search code examples
rstringalgorithmperformancelcs

How to determine an overlapping sequence of words between two texts


In one of our digital assignments, I had asked my students to read an article and write a few things they learned from that article. Students were told that they were supposed to write using their own words. I also had reasons to expect that copying and pasting a block of text or all of it was disabled. But I was so wrong. I received over 9000 entries of texts where many of them looked like they were copied and pasted directly from the digital assignments. Some had some differences in punctuations and capitalizations but I cannot imagine that they literally sat there and typed most of the article out.

I have read through many of the students' assignments and tried to identify unique features from a copied and pasted entry versus an honest one so that hopefully some R function would help me to detect. However, I have not been successful. To demonstrate, here is an example that I made up. The passages are often long, between 300-800 words and I wonder if there's a relatively easy way to identify the common block of words that overlap between the two texts.

text_1 <- "She grew up in the United States. Her father was..."
text_2 <- "I learned that she grew up in the united states.Her father was ..."

Desired Outcome: "she grew up in the united states. Her father was ..."

The desired outcome should print the sequence of words that overlapped between the two vectors, and capitalization or space differences do not matter

Thank you for reading and for any expertise you can share.


Solution

  • This is not quite what you asked for, but you can use the {stringdist} package to evaluate the "distance" between two texts, generally interpreted as the amount of characters that you would have to modify in a string in order to become equal to the reference string. So "friend" and "friendly" would have a difference of 2.

    This way you could check which texts have less differences compared to the reference text, possibly meaning that they were copied straight away from the source material.

    # https://github.com/markvanderloo/stringdist
    install.packages('stringdist')
    
    library(stringdist)
    
    base_text <- "she grew up in the united states.Her father was"
    
    text_1 <- "She grew up in the United States. Her father was"
    text_2 <- "I learned that she grew up in the united states.Her father was"
    text_3 <- "The main character was born in the USA, his father being"
    text_4 <- "My favourite animals are raccoons, they are so silly and cute"
    text_5 <- "I didn't understand this assignment so I'm just answering gibberish"
    text_6 <- "she grew up in the united states.Her father was"
    
    test_texts <- c(text_1, text_2, text_3, text_4, text_5, text_6)
    
    # calculate string distance using default method
    distances <- stringdist(base_text, test_texts)
    
    # texts that are only x or less edits away from the original text
    possible_copied_texts <- test_texts[distances <= 25]
    
    possible_copied_texts
    
    #[1] "She grew up in the United States. Her father was"              
    #[2] "I learned that she grew up in the united states.Her father was"
    #[3] "she grew up in the united states.Her father was"        
    

    If this method does not work for your use case, you can use stringdist with the The longest common substring method (method='lcs'), which is defined as the "longest string that can be obtained by pairing characters from an and b while keeping the order of characters intact." This way we can find if longer texts have a pasted text inside them, even if it is slightly modified:

    library(stringdist)
    
    base_text_2 <- "this sentence means plagiarism therefore something bad will occur"
    
    text_7 <- "random string with no words from the base text"
    text_8 <- "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"
    text_9 <- "this pretty long sentence does in fact mean that I have not plagiarized any text, instead I'm writing all by myself"
    text_10 <- "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
    text_11 <- "totally normal text"
    text_12 <- "this sentence means plagiarism therefore something bad will occur"
    text_13 <- "this sentence does not mean plagiarism and therefore something bad not will occur"
    # here, strings 8, 10, and 12 contain the base text in them, and string 13 contains a slightly modified version of the base text which would still be plagiarism
    
    # create a vector with the strings
    test_texts_2 <- c(text_7, 
                      text_8, 
                      text_9, 
                      text_10,
                      text_11,
                      text_12,
                      text_13)
    
    # but we will also add filler text before and after every string, so that they become longer
    filler <- "lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt"
    test_texts_3 <- paste(filler, test_texts_2, filler)
    
    # perform strins distance calculation with the  longest common substring method
    distances_lcs <- stringdist(base_text_2, test_texts_3, method = "lcs")
    
    # we get the distances substrazcted from the length of every string, then we substract the lenght of the base text so that strings with the base text become zero
    distance_lcs_results <- nchar(test_texts_3) - distances_lcs - nchar(base_text_2)
    
    # strings with a value of 0 means the exact text is present in the text
    distance_lcs_results
    #> [1] -38   0 -24   0 -44   0  -2
    
    # subset the vector so that we can confirm that the strings that contain the text were detected
    test_texts_2[distance_lcs_results == 0]
    #> [1] "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"                                                                                     
    #> [2] "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
    #> [3] "this sentence means plagiarism therefore something bad will occur"
    
    # but we can also get close matches, strings containing text that are not the same, but similar, to the base texts
    test_texts_2[abs(distance_lcs_results) < 20]
    #> [1] "cat dog pig chicken duck this sentence means plagiarism therefore something bad will occur food pizza sandwich pineapple"                                                                                     
    #> [2] "what so you mean you are doubting this text? you can't actually think that this sentence means plagiarism therefore something bad will occur, even if it does use words straight from the plagiarism sentence"
    #> [3] "this sentence means plagiarism therefore something bad will occur"                                                                                                                                            
    #> [4] "this sentence does not mean plagiarism and therefore something bad not will occur"
    

    You could use both methods (or more!) to create a score variable, and then make a decision based on multiple plagiarism metrics.

    Created on 2024-07-24 with reprex v2.1.0