Search code examples
c#pdfitextocrtesseract

How to preserve images and styling in PDF when creating a searchable PDF?


I have a website where my clients can upload their files (mainly PDFs). I want to be able to make the PDF searchable but I do not want the look and feel of the PDF to be changed. I have tried creating a .NET endpoint to achieve this that I can POST to.

I have tried iTextSharp in conjunction with Tesseract but neither of them are giving me what I am looking for. Here is the code that I have tried:

Using tesseract to get the text from the pdf:

     using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
     using (var img = Pix.LoadFromFile(testImagePath))
     using (var page = engine.Process(img))
     {
        var text = page.GetText();
     }

then using iTextSharp to generate the PDF from the old one:

// open the reader
PdfReader reader = new PdfReader(oldFile);
Rectangle size = reader.GetPageSizeWithRotation(1);
Document document = new Document(size);

// open the writer
FileStream fs = new FileStream(newFile, FileMode.Create, FileAccess.Write);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
document.Open();

// the pdf content
PdfContentByte cb = writer.DirectContent;

// select the font properties
BaseFont bf = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252,BaseFont.NOT_EMBEDDED);
cb.SetColorFill(BaseColor.DARK_GRAY);
cb.SetFontAndSize(bf, 8);

// write the text in the pdf content
cb.BeginText();
string text = "Some random blablablabla...";
// put the alignment and coordinates here
cb.ShowTextAligned(1, text, 520, 640, 0);
cb.EndText();
cb.BeginText();
text = "Other random blabla...";
// put the alignment and coordinates here
cb.ShowTextAligned(2, text, 100, 200, 0);
cb.EndText();

// create the new page and add it to the pdf
PdfImportedPage page = writer.GetImportedPage(reader, 1);
cb.AddTemplate(page, 0, 0);

// close the streams and voilá the file should be changed :)
document.Close();
fs.Close();
writer.Close();
reader.Close();

I am having issues generating the desired output however. Is there a simpler way to achieve what I am looking for? Here is an example of a PDF I am trying to make searchable. I do not want to lose the images or the font / styling of the PDF. I just want it to become searchable:

https://www.fujitsu.com/global/Images/sv600_c_normal.pdf


Solution

  • If you're interested in leveraging a commercial product for this, the LEADTOOLS SDK has an OCR toolkit with image-over-text functionality. This feature sets an image of the original file as an overlay in the output PDF, both making the text searchable and maintaining the appearance of the original input file.

    I was able to convert your document to a searchable version still representing the original using this code:

         string folderPath = "filepath";
    
         string inputFilename = Path.Combine(folderPath, "sv600_c_normal.pdf");
         string outputFilename = Path.Combine(folderPath, "sv600_c_normal-output.pdf");
    
         IOcrEngine engine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD);
         engine.Startup(null, null, null, null);
    
         PdfDocumentOptions pdfOptions = engine.DocumentWriterInstance.GetOptions(DocumentFormat.Pdf) as PdfDocumentOptions;
         pdfOptions.ImageOverText = true;
         engine.DocumentWriterInstance.SetOptions(DocumentFormat.Pdf, pdfOptions);
    
         engine.AutoRecognizeManager.Run(inputFilename, outputFilename, DocumentFormat.Pdf, null, null);
         
         
    

    Here's the output of the sample file. It's searchable and resembles the original.

    Disclaimer: I work for this company