Search code examples
typescriptazurepdfpdf-generationocr

OCR - Azure Document Intelligence to recreate document digitally


Where i work we have lots of scanned documents, we want to digitalize them without losing the general format of the document, a document can have many key-value pairs like forms, titles, plaragraphs, text in images, etc. We dont really care about losing the images but we do want the text to show in the place it would.

I have created a simple react app with typescript where you chose some files and then, connecting to AzureDocumentIntelligence's analysis features, I get a response like

{
  "content": "the complete found text in the document"
  "pages": [{
    "lines":[{
      "content": "the text for this line"
      "polygon": [xx,xx,xx,xx,xx,xx,xx,xx] //coordinates of where the text should be
    }]
  }]
}

I was able to generate a txt file with the whole content which was a good start, however, now I want to recreate the document having the text in the exact position every time and generate a pdf out of it. I thought that i could do it using the polygon and the library pdf-lib. But there are many issues with what i tried as it does not seem to understand the coordinates and just puts everything in the same place.

This is my method:

async function createPdf(data: AnalyzeResultOutput) {
  const pdfDoc = await PDFDocument.create();
  const timesRomanFont = await pdfDoc.embedFont(StandardFonts.TimesRoman);

  for (const page of data.pages) {
    const pdfPage = pdfDoc.addPage();

    for (const line of page.lines ?? []) {
      console.log(line);
      const { content, polygon } = line;
      const x = polygon![0];
      const y = polygon![1];

      const size = 12;
      pdfPage.drawText(content, {
        x,
        y,
        size,
        maxWidth: Math.abs(polygon![0] - polygon![2]),
        font: timesRomanFont,
      });
    }
  }

  const pdfBytes = await pdfDoc.save();
  return pdfBytes;
}

I don't need the document to be perfect but to mantain the general flow of the page. Does anyone know what i could do?

Also currently I am using azure but I have read documentation for other cloud services and all of them basically have the same idea, but if you know some other service I could be using I would be fine with changing.


Solution

  • The text is placed in a readable and somewhat structured manner. If perfect alignment is not required, simplifying the placement logic can help maintain the overall flow of the document.

    • Transform coordinates to match the PDF coordinate system. Draw text based on the bounding box provided by the polygon coordinates.

    Code:

    import { PDFDocument, StandardFonts } from 'pdf-lib';
    import { AnalyzeResultOutput } from './types'; // Ensure you have a type definition for AnalyzeResultOutput
    
    async function createPdf(data: AnalyzeResultOutput) {
      const pdfDoc = await PDFDocument.create();
      const timesRomanFont = await pdfDoc.embedFont(StandardFonts.TimesRoman);
    
      for (const page of data.pages) {
        const { width, height } = page;
        const pdfPage = pdfDoc.addPage([width, height]);
    
        for (const line of page.lines ?? []) {
          console.log(line);
          const { content, polygon } = line;
    
          // Calculate the bounding box dimensions
          const xMin = Math.min(polygon[0], polygon[2], polygon[4], polygon[6]);
          const xMax = Math.max(polygon[0], polygon[2], polygon[4], polygon[6]);
          const yMin = Math.min(polygon[1], polygon[3], polygon[5], polygon[7]);
          const yMax = Math.max(polygon[1], polygon[3], polygon[5], polygon[7]);
    
          // Transform the coordinates for PDF
          const x = xMin;
          const y = height - yMax;
    
          const size = 12;
          const textWidth = xMax - xMin;
          const textHeight = yMax - yMin;
    
          pdfPage.drawText(content, {
            x,
            y,
            size,
            font: timesRomanFont,
            maxWidth: textWidth,
            lineHeight: textHeight,
          });
        }
      }
    
      const pdfBytes = await pdfDoc.save();
      return pdfBytes;
    }
    
    export default createPdf;
    
    • The bounding box is calculated using the minimum and maximum x and y values from the polygon coordinates. This defines the area in which the text should be placed.

    • The y-coordinate is transformed to match the PDF coordinate system: const y = height - yMax;. This moves the origin from the top-left to the bottom-left.

    • pdfPage.drawText(content, { x, y, size, font: timesRomanFont, maxWidth: textWidth, lineHeight: textHeight });. The text is drawn within the calculated bounding box dimensions.

    enter image description here

    NOTE: please check that the PDF page size and orientation match the scanned document. Incorrect page size or orientation could cause the text to appear displaced.