Search code examples
imagepandoc

Control the Pandoc word document output size / test image sizes


My client wants to convert markdown text to word and we'll be using Pandoc. However, we want to control malicious submissions (e.g., a Markdown doc with 1000 externally hosted images each being 10 MB) that can stress/break the server when attempting to produce the output.

options are to regex the image patterns in the Markdown and test their size (or even limit the number) or even disallow external images entirely, but I wonder if there's a way to abort Pandoc if the produced docx exceeds a certain size?

Or is there a simple way to get the images and test their size?


Solution

  • Pandoc normally fetches the images while writing the output file, but you can take control of that by using a Lua filter to fetch the images yourself. This allows to stop fetching as soon as the combined size of the images becomes too large.

    local total_size_images = 0
    local max_images_size = 100000  -- in bytes
    
    -- Process all images
    function Image (img)
      -- use pandoc's default method to fetch the image contents
      local mimetype, contents = pandoc.mediabag.fetch(img.src)
      -- check that contents isn't too large
      total_size_images = total_size_images + #contents
      if total_size_images > max_images_size then
        error('images too large!')
      end
      -- replace image path with the hash of the image's contents.
      local new_filename = pandoc.utils.sha1(contents)
      -- store image in pandoc's "mediabag", so it won't be fetched again.
      pandoc.mediabag.insert(new_filename, mimetype, contents)
      img.src = new_filename
      -- return the modified image
      return img
    end
    

    Please make sure to read the section "A note on security" in the pandoc manual before publishing the app.