Search code examples
excelsearchadobe-reader

Efficient Way of Recording Page Numbers from a Search of a PDF


I have a list of ~1200 queries (part numbers) that are specified somewhere inside of a 100 page PDF. Pretty much what I need to do is take record of what pages each of the queries appear on, in the PDF. I can't think of a clever way of doing this. It should take me 5-20 hours to do this search by search, so if someone can give me a good idea before the 5 hour mark that would be great!


Solution

  • Assumed you can determine what a "query" is in your context programatically from the plain text (for example, by using regular expressions):

    You could split your PDF into different files (1 file per page) using pdftk

    http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

    Then convert those files to text with a pdf-to-text utility like this one:

    http://www.fileguru.com/PDF-To-TXT-Converter/download

    or this one

    http://www.pdf2text.com/

    And finally write yourself a simple script using your favorite programming language to determine which of those files contains a "query" (whatever that looks like).