Search code examples
javaalgorithmgroovydetectioncopy-paste

Detecting copied or similar text blocks


I have a bunch of texts about programming in Markdown format. There is a build process that is capable of converting those texts into Word/HTML and also perform simple validation rules like spell checking or checking if document has required header structure. I would like to extend that build code to also check for copy-pasted or similar chunks within all texts.

Is there any existing Java/Groovy library that can help me with that analysis?

My first idea was to use PMD's CopyPasteDetector, but it is too much oriented to analyse real code. I don't see how I can use it to analyse normal text.


Solution

  • I ended up using CPD and Groovy after all. Here is the code if some one is interested:

    import net.sourceforge.pmd.cpd.Tokens
    import net.sourceforge.pmd.cpd.TokenEntry
    import net.sourceforge.pmd.cpd.Tokenizer
    import net.sourceforge.pmd.cpd.CPDNullListener
    import net.sourceforge.pmd.cpd.MatchAlgorithm
    import net.sourceforge.pmd.cpd.SourceCode
    import net.sourceforge.pmd.cpd.SourceCode.StringCodeLoader
    import net.sourceforge.pmd.cpd.SimpleRenderer
    
    // Prepare empty token data.
    TokenEntry.clearImages()
    def tokens = new Tokens()
    
    // List all source files with text.
    def source = new TreeMap<String, SourceCode>()
    new File('.').eachFile { file ->
      if (file.isFile() && file.name.endsWith('.txt')) {
        def analyzedText = file.text
        def sourceCode = new SourceCode(new StringCodeLoader(analyzedText, file.name))
        source.put(sourceCode.fileName, sourceCode)
        analyzedText.eachLine { line, lineNumber ->
          line.split('[\\W\\s\\t\\f]+').each { token ->
            token = token.trim()
            if (token) {
              tokens.add(new TokenEntry(token, sourceCode.fileName, lineNumber + 1))
            }
          }
        }
        tokens.add(TokenEntry.getEOF())
      }
    }
    
    // Run matching algorithm.
    def maxTokenChain = 15
    def matchAlgorithm = new MatchAlgorithm(source, tokens, maxTokenChain, new CPDNullListener())
    matchAlgorithm.findMatches()
    
    // Produce report.
    matchAlgorithm.matches().each { match ->
      println "  ========================================"
      match.iterator().each { mark ->
        println "  DUPLICATION ERROR: <${mark.tokenSrcID}:${mark.beginLine}> [DUPLICATION] Found a ${match.lineCount} line (${match.tokenCount} tokens) duplication!"
      }
      def indentedTextSlice = ""
      match.sourceCodeSlice.eachLine { line ->
        indentedTextSlice += "  $line\n"
      }
      println "  ----------------------------------------"
      println indentedTextSlice
      println "  ========================================"
    }