Search code examples
stringrduplicate-detection

Can R detect duplicate sentences in a word file?


I have one word document contains 100 pages and want to detect duplicate sentences. Is there any way to automatically do this in R?

1- convert to a txt file 2-read:

     tx=readLines("C:\\Users\\paper-2013.txt")

Solution

  • Here a small code chunk that I have used previously, which is loosely based on Matloff's The Art of R Programming, where he used sth. similar as an example:

     sent <- "This is a sentence. Here comes another sentence. This is a sentence. This is a sentence. Sentence after sentence. This is two sentences."
    

    You can split every sentence when there are full stops using strsplit:

     out <- strsplit(sent, ".", fixed=T)
     library(gdata)
     out <- trim(out) # trims leading white spaces.
    

    Now, this may seem clumsy, but bear with me:

     outlist <- list()
     for(i in 1:length(unlist(out))){
       outlist[[out[[1]][i]]] <- c(outlist[[out[[1]][i] ]],i)
     }
    

    Now you have a list in which every entry is the sentences itself (as name) and the position where the sentence occurs. You can now use length-arguments to see how many sentences are duplicated. But you can also see if there are direct duplicates which helps to distinguish between writing the same sentence twice by mistake (e.g. "My name is R. My name is R."), or coincidentially repeating the same sentence at very different positions in the text without it being a problem (e.g. sentences like "Here is an example." which may exist in your text several times without it being a problem).

     > outlist
     $`This is a sentence`
     [1] 1 3 4
     $`Here comes another sentence`
     [1] 2
     $`Sentence after sentence`
     [1] 5
     $`This is two sentences`
     [1] 6