Search code examples
amazon-web-servicespdfthumbnailsgraphicsmagick

Thumbnail the first page of a pdf from a stream in GraphicsMagick


I know how to use GraphicsMagick to make a thumbnail of the first page of a pdf if I have a pdf file and am running gm locally. I can just do this:

gm(pdfFileName + "[0]")
  .background("white")
  .flatten()
  .resize(200, 200)
  .write("output.jpg", (err, res) => {
    if (err) console.log(err);
  });

If I have a file called doc.pdf then passing doc.pdf[0] to gm works beautifully.

But my problem is I am generating thumbnails on an AWS Lambda function, and the Lambda takes as input data streamed from a source S3 bucket. The relevant slice of my lambda looks like this:

// Download the image from S3, transform, and upload to a different S3 bucket.
async.waterfall([
  function download(next) {
    s3.getObject({
      Bucket: sourceBucket,
      Key: sourceKey
    },
    next);
  },

  function transform(response, next) {
    gm(response.Body).size(function(err, size) {       // <--- gm USED HERE
    .
    .
    .

Everything works, but for multipage pdfs, gm is generating a thumbnail from the last page of the pdf. How do I get the [0] in there? I did not see a page selector in the gm documentation as all their examples used filenames, not streams I believe there should be an API, but I have not found one.

(Note: the [0] is really important not only because the last page of multipage PDFs are sometimes blank, but I noticed when running gm on the command line with large pdfs, the [0] returns very quickly while without the [0] the whole pdf is scanned. On AWS Lambda, it's important to finish quickly to save on resources and avoid timeouts!)


Solution

  • You can use .selectFrame() method, which is equivalent to specifying [0] directly in file name.

    In your code:

    function transform(response, next) {
        gm(response.Body)
            .selectFrame(0)       // <--- select the first page
            .size(function(err, size) {
            .
            .
            .
    

    Don't get confused about the name of function. It work not only with frames for GIFs, but also works just fine with pages for PDFs.

    Checkout this function source on GitHub.

    Credits to @BenFortune for his answer to similar question about GIFs first frame. I've took it as inspiration and tested this solution with PDFs, it actually works.

    Hope it helps.