Search code examples
marklogic

Can xdmp:pdf-convert perform optical character recognition (OCR)?


Looking at the options in the documentation for xdmp:pdf-convert, it seems like MarkLogic can perform OCR but my testing of it has not been successful. The ignore-text option in the documentation reads:

Enable/disable extraction of text from images. Documents consisting of scanned pages can only have text extracted if this parameter is set to true; however, diagrams with embedded text labels may be less palatable. For page-by-page conversion, the problem with reflowing of text and graphical elements within a diagram giving poor results is not such a problem, and the value of false will probably be the better choice.

However, in my tests with PDFs containing scanned pages, no extraction of the text is taking place. I have even tried creating my own PDF with a screenshot of lorem ipsum text. MarkLogic is correctly extracting the images into their own files but the resulting XHTML only contains a reference to the image. Has anyone had success with using xdmp:pdf-convert to perform OCR or have you had to use another tool for the OCR? In the end we would like to make the scanned PDFs searchable and available for parsing/transform.

Sample XHTML created from my simple PDF:

<body class="font-0">
    <span class="pageStart" id="pgs0001">
    </span>
    <p
        style="text-align: left; line-height: 15.6pt; text-indent: 0pt; margin-left: 0pt; margin-right: 0pt; padding-left: 58.91pt; padding-top: 0; padding-bottom: 0; padding-right: 0; z-index: 100;">
        <span class="textStyle0">
            <a name="t0" id="t0">
            </a>
            This is text I typed into the PDF
        </span>
    </p>
    <div
        style="width: 768.00pt; height: 336.00pt; clip: rect(0pt, 768.00pt, 336.00pt, 0pt); margin-left: 49.09pt; margin-top: 0; margin-bottom: 0; margin-right: 0; padding: 0 0 0 0; z-index: 00;">
        <img src="testOcrPdf_pdf_parts/0001_00.jpg" width="768.00pt" height="336.00pt" border="0"
            alt="testOcrPdf_pdf_parts/0001_00.jpg(1587x695)">
        </img>
    </div>
    <span class="pageEnd" id="pge0001">
    </span>
</body>


Solution

  • MarkLogic does not provide OCR on PDF (or other) documents. You'll need to use something external.