Search code examples
rmemorysizefigure

How to find byte sizes of R figures on pages?


I would like to monitor the basic quality of the figures produced in R on individual pages such as byte size of each page,... I can now do only quality assurance of average pages, see the following chapter about it. I think there must be something builtin for the task than average measures.

Code which produces 4 pages in Rplots.pdf where I would like to know the byte size of each page in an output here; any other statistics of the page outputs is also welcome; you can get the basic memory monitoring by objects here but I would like it to correspond to the outputs in PDF

# https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.html
require(stats) # for lowess, rpois, rnorm
plot(cars)
lines(lowess(cars))

plot(sin, -pi, 2*pi) # see ?plot.function

## Discrete Distribution Plot:
plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
     main = "rpois(100, lambda = 5)")

## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
points(x, cex = .5, col = "dark red")

## TODO summarise here the byte size of figures in the figures (1-4)
# Output: Rplot.pdf where 4 pages; I want to know the size of each page in bytes

I am currently doing the basic quality assurance in command-line but would like to move some of it to R, to observe bugs faster.

Expected output: byte size, for instance like 4th column of ls -l

To get bytesize of average individual page in an output document

Limitations

  • Requirement of the homogeneity of the data in pages. This method only works if the pages are all from the same sample. Otherwise, it is troublesome because it is only average, not describing then the individual phenomenons. Other possible weaknesses
  • PDF-elements and meta data. Consider PDF-file as whole, not focusing on the graphic objects itself. So this limits the absolute value use because the filesize contains also headers and other meta data which are not about the graphic objects.

Code

filename <- "main.pdf"
filesize <- file.size(filename)
# http://unix.stackexchange.com/q/331175/16920
pages <- Rpoppler::PDF_info(filename)$Pages 

# print page size (= filesize / pages)
pagesize <- filesize / pages

## data of example file 
num 7350960
int 62
num 118564

Input: just any 62-pages document
Output: average individual page size (118564)

Testing and's answer

Output but you cannot change the input easily to your wanted PDF-file

     files                             size_bytes 
[1,] "./test_page_size_pdf/page01.pdf" "4,123,942"
[2,] "./test_page_size_pdf/page02.pdf" "    4,971"
[3,] "./test_page_size_pdf/page03.pdf" "    4,672"
[4,] "./test_page_size_pdf/page04.pdf" "    5,370"

Input: just any 64-pages document
Expected output: 67 (= 64 + 3) pages, not 4 analysed

R: 3.3.2
OS: Debian 8.5


Solution

  • Download and install the pdftk utility if it is not already on your system and then try one of the following alternatives this from within R.

    1) It will return a data frame with the page file sizes in bytes and other information.

    myfile <- "Rplots.pdf"
    system(paste("pdftk", myfile, "burst"))
    file.info(Sys.glob("pg_*.pdf"))
    

    It will also generate a file doc_data.txt with some miscellaneous information that may or may not be of interest.

    1a) This alternative will not generate any files. It will simply return the character sizes of the pages as a numeric vector.

    myfile <- "Rplots.pdf"
    pages <- as.numeric(read.dcf(pipe(paste("pdftk", myfile, "dump_data")))[, "NumberOfPages"])
    cmds <- sprintf("pdftk %s cat %d output - | wc -c", myfile, seq_len(pages))
    unname(sapply(cmds, function(cmd) scan(pipe(cmd), quiet = TRUE)))
    

    The above should work if pdftk and wc are on your path. Note that on Windows you can find wc in the Rtools distribution and is typically at "C:\\Rtools\\bin\\wc" once Rtools is installed.

    2) This alternative is similar to (1) but uses the animation package:

    library(animation)
    
    ani.options(pdftk = "/path/to/pdftk")
    pdftk("Rplots.pdf", "burst", "pg_%04d.pdf", "")
    file.info(Sys.glob("pg_*.pdf"))