Search code examples
latexplagiarism-detection

How do I extract significant text content from a LaTeX document


I need to extract text-only content from my thesis document written in LaTeX for an automated anti-plagiarism check. I know only about the "draft" option and it's not enough.

I am supposed to omit:

  • images,
  • tables and other figures,
  • equations,
  • captions and footnotes.

It'd also be nice to remove all the references. The output should be a plain (UTF-8 encoded) text file.

Is there any straightforward way to do this? I don't really fancy copying it manually page-by-page.


Solution

  • You could try to use the comment package (or one of a dozen of alternatives) to turn equation, figure, table etc. into commenting environments and \renewcommand\footnote[1]{} to remove footnotes. \pagestyle{empty} should remove page headings etc., so running pdftotext on the result should come close ot what you want.