Search code examples
htmlwgetmirroringplagiarism-detectionhttrack

How can I mirror the results of MOSS plagiarism detection?


MOSS is a well-known server for checking software plagiarism. It allows teachers to send homework submissions, calculates the similarity between different submissions, and colors code blocks that are very similar. Here is an example of the results of the comparison. As you can see, it is very simple: it contains an HTML file with the index of the suspected files, and it contains links to specific HTML files for the comparison.

The results are kept on the MOSS website for two weeks. I would like to download all the results into my computer, so that I can view them later. I use this command on Linux:

wget -mkEpnp http://moss.stanford.edu/results/5/7683916027631/index.html

What I get is the following:

enter image description here

As you can see, only the index.html file is downloaded. The other files, that are linked from the index.html, e.g. match0.html and match1.html, are not downloaded.

I tried to mirror the same website with a different tool - Web HTTrack - but got exactly the same results - only the index file is mirrored, and not the match files.

The HTML looks very simple, so I cannot figure out why the mirroring does not work. What can I do to correctly mirror the results?

P.S. In case it is relevant, the robots.txt file contains the following:

User-agent: *
Disallow: /

Solution

  • you need to ignore robots.txt file e.g.

    wget -r -l 1 -e robots=off http://moss.stanford.edu/results/1/XXXXXXXXXX/