Search code examples
marklogic

Extract number of pages from PDF files


We're trying to use xdmp:document-filter function to extract metadata about PDF files, in particularly we want to know the number of pages in a PDF. It seems that currently MarkLogic is not able to retrieve this information for PDFs (nor for Word documents) but is able to get the number of slides for a PowerPoint. Maybe there is there a hidden option?

https://docs.marklogic.com/guide/search-dev/binary-document-metadata#id_98155 https://docs.marklogic.com/xdmp:document-filter

At some stage we may want to also extract metadata from audio files (MP3), such as duration in seconds and stereo/mono. Is this something that may be possible?


Solution

  • Please note that there are 2 approaches within Marklogic when it comes to extraction from files:

    1. xdmp:document-filter() as you have already oncovered.

    2. The eternally bundled document conversion libraries offering the xdmp:xxx-convert() functions

    For the second option, there is a completely different engine for whiich one option is to generate an xhtml document per page. I would suggest that you explore options using xdmp:pdf-convert()

    This can have a unexpected effect of creating multiple documents in the system, but it may still serve your purpose once you work through and try the various options. The first node returned is the manifest, so that may have enough info to count pages if you extract per page. The trick will be to get the information you need without overhead of extracting items that are not needed.. If this helps, then you can also explore the other convert functions in the same family such as the one for word.