Search code examples
pythonsplitbookmarkspdftk

Trying to split a large .pdf into multiple files. (python, pdftk)


I've written a script in Python that will split a .pdf by chapter/bookmark. Here is essentially the crux of my program:

for each chapter:
    system('pdftk A=file.pdf cat A{start}-{end} output file2.pdf')

The toolkit works lovely, but invoking it over and over is obviously not a time-efficient task. Parsing a 200mb .pdf file takes a solid 15-20 seconds, and doing so over the span of some 30 individual chapters takes a long time. More time is spent opening the file than actually writing any data.

Since there doesn't seem to be an inherent way to string multiple commands within the toolkit, is there any memory trickery I can pull in Python or the CMD that will let me get around this (i.e. keep the .pdf open)? I'll look at another module, too, if you can suggest one (pyPdf has its own slew of problems though).


Solution

  • To keep the pdf file in memory, read it into a StringIO buffer and tell pdftk to read from stdin. Specifically: Use subprocess.call instead of os.system, with your StringIO buffer as the stdin argument:

    mybuffer = StringIO.StringIO(open('file.pdf').read())
    subprocess.call('pdftk ...', stdin=mybuffer)
    

    It will still need to parse the pdf file anew each time, but at least you won't be spinning your hard drive more than you have to. The only really fast way is to use a tool that can do it in one pass (e.g., solve whatever problems you have with pypdf).