Search code examples
rpdfpdftools

Split PDF files in multiples files every 2 pages in R


I have a PDF document with 300 pages. I need to split this file in 150 files containing each one 2 pages. For example, the 1st document would contain pages 1 & 2 of the original file, the 2nd document, the pages 3 & 4 and so on.

Maybe I can use the "pdftools" package, but I don't know how.


Solution

  • 1) pdftools Assuming that the input PDF is in the current directory and the outputs are to go into the same directory, change the inputs below and then get the number of pages num, compute the st and en vectors of start and end page numbers and repeatedly call pdf_subset. Note that the pdf_length and pdf_subset functions come from the qpdf R package but are also made available by the pdftools R package by importing them and exporting them back out.

    library(pdftools)
    
    # inputs
    infile <- "a.pdf"  # input pdf
    prefix <- "out_"  # output pdf's will begin with this prefix
    
    num <- pdf_length(infile)
    st <- seq(1, num, 2)
    en <- pmin(st + 1, num)
    
    for (i in seq_along(st)) {
      outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
      pdf_subset(infile, pages = st[i]:en[i], output = outfile)
    }
    

    2) pdfbox The Apache pdfbox utility can split into files of 2 pages each. Download the .jar command line utilities file from pdfbox and be sure you have java installed. Then run this assuming that your input file is a.pdf and is in the current directory (or run the quoted part directly from the command line without the quotes and without R). The jar file name below may need to be changed if a later version is to be used. The one named below is the latest one currently (not counting alpha version).

    system("java -jar pdfbox-app-2.0.26.jar PDFSplit -split 2 a.pdf")
    

    3) animation/pdftk Another option is to install the pdftk program, change the inputs at the top of the script below and run. This gets the number of pages in the input, num, using pdftk and then computes the start and end page numbers, st and en, and then invokes pdftk repeatedly, once for each st/en pair to extract those pages into another file.

    library(animation)
    
    # inputs
    PDFTK <- "~/../bin/pdftk.exe"  # path to pdftk
    infile <- "a.pdf"  # input pdf
    prefix <- "out_"  # output pdf's will begin with this prefix
    
    ani.options(pdftk = Sys.glob(PDFTK))
    
    tmp <- tempfile()
    dump_data <- pdftk(infile, "dump_data", tmp)
    g <- grep("NumberOfPages", readLines(tmp), value = TRUE)
    num <- as.numeric(sub(".* ", "", g))
    
    st <- seq(1, num, 2)
    en <- pmin(st + 1, num)
    
    for (i in seq_along(st)) {
      outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
      pdftk(infile, sprintf("cat %d-%d", st[i], en[i]), outfile)
    }