Search code examples
c#pdfitext

How to copy OCR text data to a new pdf Itext 7 C#


I have a scanned document. In this document there is the transparent layer of text and image on it. Is there a way to copy text as it is (with no changes,so it remains transparent and on the same position ) to another .pdf what I created(without images)? I searched for it in google, and didn't find any solutions. I know that I can copy text from the pages to string, then add it to my new document with a new paragraph. But it will destroy transparency and locations of recognized letters. What I really want to do is change the image beneath OCR text. And firstly the idea was to remove all images from .pdf and add new images. Then I understood that it is not a good idea and not very easy to do as there is a lot of different image types.(but I did it this way, look on to solution)

Adding samples:

My sample document is scanned document in which I did OCR.

Sample document

My real code sample is here:

                    String dest = "C:\\ImagePaged.pdf";
                    PdfWriter writer = new PdfWriter(dest);
                   
                    // Creating a PdfDocument  
                    pdfDoc = new PdfDocument(writer);
                    

                    // Creating a Document   Document 
                    iText.Layout.Document document2 = new iText.Layout.Document(pdfDoc);
                    document2.SetMargins(0, 0, 0, 0);

                    //////////////////////

                    List<int> rotatedPages = new List<int>();
                    using (FileStream fs = new FileStream(@"C:\\source.pdf", FileMode.Open))
                 
                    using (Document document = new Document(fs)) // this object represents a PDF document
                    
                    {

                
                        // process and save pages one by one
                        for (int i = 0; i < document.Pages.Count; i++)
                        {
                     

       
                            Page currentPage = document.Pages[i];
                            
                            // we use original page's width and height for image as well as default rendering settings
                            using (Bitmap bitmap = currentPage.Render((int)currentPage.Width*3, (int)currentPage.Height*3, new RenderingSettings()))
                          
                            {
                                if (bitmap.Width>bitmap.Height)
                                {
                                    rotatedPages.Add(i+1);

                                    bitmap.RotateFlip(RotateFlipType.Rotate90FlipNone);

                                }
                                

 bitmap.Save($"C:\\ImagePage{i}.png", ImageFormat.Png);

iText.IO.Image.ImageData imageData = iText.IO.Image.ImageDataFactory.Create($"C:\\ImagePage{i}.png");

 Image image = new Image(imageData);
                             
imageData = null;
             
                                
   document2.Add(image);
                                
     image = null;
     File.Delete($"C:\\ImagePage{i}.png");


                                


                            }
                            GC.Collect();
                        }
                        document.Dispose();
                        document2.Close();
                        GC.Collect();

UPDATE:SOLUTION

Thanks to code part provided from mkl I was able to build up the solution.

  1. I used Apitron to generate images from my watermarked pdf for every page(look up for code sample).
  2. I used code provided by mkl, to delete all images from my original pdf document.
  3. I used Itext to add images created by Apitron to pdf file created in clause nr.2.

I won´t post here my entire solution, but important things to keep in mind while doing it are:

a. Set margins in your destination document like this:

document2.SetMargins(0, 0, 0, 0);

b. Rotate your image before adding it.

 int rotations=  pdfDoc.GetPage(i+1).GetRotation();
                            
                                if (rotations>0)
                                {
                                   

                                    if (rotations == 270)
                                    {
                                        bitmap.RotateFlip(RotateFlipType.Rotate270FlipXY);
                                    } else
                                          if (rotations == 90)
                                    {
                                        bitmap.RotateFlip(RotateFlipType.Rotate90FlipXY);
                                    }
                                    if (rotations == 180)
                                    {
                                        bitmap.RotateFlip(RotateFlipType.Rotate180FlipXY);
                                    }

c. Create image with added page number.

image = new Image(imageData).SetFixedPosition(i + 1, 0, 0).SetAutoScale(true);   (the first argument i, is the page number)

Solution

  • In the comments I proposed to replace the image XObjects with a form Xobject without any instructions. You tried to use a plain PdfDictionary for this but that is too plain. Instead try a method like this using the PDF object underneath an empty PdfFormXObject as replacement:

    void replaceImages(PdfResources pdfResources, PdfObject replacement)
    {
        PdfDictionary xobjects = pdfResources.GetPdfObject().GetAsDictionary(PdfName.XObject);
        if (xobjects == null)
            return;
        ISet<PdfName> toReplace = new HashSet<PdfName>();
        foreach (KeyValuePair<PdfName, PdfObject> entry in xobjects.EntrySet())
        {
            PdfObject pdfObject = entry.Value;
            if (pdfObject is PdfIndirectReference reference)
                pdfObject = reference.GetRefersTo();
            if (pdfObject is PdfStream pdfStream && PdfName.Image.Equals(pdfStream.GetAsName(PdfName.Subtype)))
            {
                toReplace.Add(entry.Key);
            }
        }
        foreach (PdfName name in toReplace)
        {
            xobjects.Put(name, replacement);
        }
    }
    

    (ReplaceImageWithEmptyObject helper method)

    Beware, this changes the actual resources without updating the internal cache of the PdfResources instance. Thus, its state may become inconsistent and you shouldn't use that instance for other operations.

    You can apply it to a PdfDocument pdfDocument like this:

    PdfFormXObject replacement = new PdfFormXObject(new Rectangle(1, 1));
    for (int pageNr = 1; pageNr <= pdfDocument.GetNumberOfPages(); pageNr++)
    {
        PdfPage pdfPage = pdfDocument.GetPage(pageNr);
        PdfResources pdfResources = pdfPage.GetResources();
        replaceImages(pdfResources, replacement.GetPdfObject());
    }
    

    (ReplaceImageWithEmptyObject test testReplaceForVZ)

    It works as expected with your example PDF.