I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files. I would like to :
My questions are :
Perl
, swish-e
, pdfgrep
to make a script. Do you know other tools which could be useful?Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf
, here are some options if your needs aren't too elaborate
Use CAM::PDF
(and CAM::PDF::PageText
) or PDF-API2
modules
Use pdftotext
from the poppler
library (probably in poppler-utils
package)
Use pdftohtml
with -xml
option, read the generated simple XML file with XML::libXML
or XML::Twig
The last two are external tools which you use via Perl's builtins like system
.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template
. Also see this post
, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf
so you may still be better off with your own script.