Search code examples
perlpdfindexingswish

Index PDF files and generate keywords summary


I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files. I would like to :

  1. Parse the contents of the PDF files to get keywords.
  2. Select the most relevant keywords to make a summary.
  3. Create static HTML pages for some keywords with entries linked to the appropriate files.

My questions are :

  • Is there an existing tool to perform the whole job ?
  • What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
  • I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?

Solution

  • Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).

    Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.

    As for reading pdf, here are some options if your needs aren't too elaborate

    The last two are external tools which you use via Perl's builtins like system.

    The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.

    Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.

    Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.

    If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.