I have a scanned document. In this document there is the transparent layer of text and image on it. Is there a way to copy text as it is (with no changes,so it remains transparent and on the same position ) to another .pdf what I created(without images)? I searched for it in google, and didn't find any solutions. I know that I can copy text from the pages to string, then add it to my new document with a new paragraph. But it will destroy transparency and locations of recognized letters. What I really want to do is change the image beneath OCR text. And firstly the idea was to remove all images from .pdf and add new images. Then I understood that it is not a good idea and not very easy to do as there is a lot of different image types.(but I did it this way, look on to solution)
Adding samples:
My sample document is scanned document in which I did OCR.
My real code sample is here:
String dest = "C:\\ImagePaged.pdf";
PdfWriter writer = new PdfWriter(dest);
// Creating a PdfDocument
pdfDoc = new PdfDocument(writer);
// Creating a Document Document
iText.Layout.Document document2 = new iText.Layout.Document(pdfDoc);
document2.SetMargins(0, 0, 0, 0);
//////////////////////
List<int> rotatedPages = new List<int>();
using (FileStream fs = new FileStream(@"C:\\source.pdf", FileMode.Open))
using (Document document = new Document(fs)) // this object represents a PDF document
{
// process and save pages one by one
for (int i = 0; i < document.Pages.Count; i++)
{
Page currentPage = document.Pages[i];
// we use original page's width and height for image as well as default rendering settings
using (Bitmap bitmap = currentPage.Render((int)currentPage.Width*3, (int)currentPage.Height*3, new RenderingSettings()))
{
if (bitmap.Width>bitmap.Height)
{
rotatedPages.Add(i+1);
bitmap.RotateFlip(RotateFlipType.Rotate90FlipNone);
}
bitmap.Save($"C:\\ImagePage{i}.png", ImageFormat.Png);
iText.IO.Image.ImageData imageData = iText.IO.Image.ImageDataFactory.Create($"C:\\ImagePage{i}.png");
Image image = new Image(imageData);
imageData = null;
document2.Add(image);
image = null;
File.Delete($"C:\\ImagePage{i}.png");
}
GC.Collect();
}
document.Dispose();
document2.Close();
GC.Collect();
UPDATE:SOLUTION
Thanks to code part provided from mkl I was able to build up the solution.
I won´t post here my entire solution, but important things to keep in mind while doing it are:
a. Set margins in your destination document like this:
document2.SetMargins(0, 0, 0, 0);
b. Rotate your image before adding it.
int rotations= pdfDoc.GetPage(i+1).GetRotation();
if (rotations>0)
{
if (rotations == 270)
{
bitmap.RotateFlip(RotateFlipType.Rotate270FlipXY);
} else
if (rotations == 90)
{
bitmap.RotateFlip(RotateFlipType.Rotate90FlipXY);
}
if (rotations == 180)
{
bitmap.RotateFlip(RotateFlipType.Rotate180FlipXY);
}
c. Create image with added page number.
image = new Image(imageData).SetFixedPosition(i + 1, 0, 0).SetAutoScale(true); (the first argument i, is the page number)
In the comments I proposed to replace the image XObjects with a form Xobject without any instructions. You tried to use a plain PdfDictionary
for this but that is too plain. Instead try a method like this using the PDF object underneath an empty PdfFormXObject
as replacement
:
void replaceImages(PdfResources pdfResources, PdfObject replacement)
{
PdfDictionary xobjects = pdfResources.GetPdfObject().GetAsDictionary(PdfName.XObject);
if (xobjects == null)
return;
ISet<PdfName> toReplace = new HashSet<PdfName>();
foreach (KeyValuePair<PdfName, PdfObject> entry in xobjects.EntrySet())
{
PdfObject pdfObject = entry.Value;
if (pdfObject is PdfIndirectReference reference)
pdfObject = reference.GetRefersTo();
if (pdfObject is PdfStream pdfStream && PdfName.Image.Equals(pdfStream.GetAsName(PdfName.Subtype)))
{
toReplace.Add(entry.Key);
}
}
foreach (PdfName name in toReplace)
{
xobjects.Put(name, replacement);
}
}
(ReplaceImageWithEmptyObject helper method)
Beware, this changes the actual resources without updating the internal cache of the PdfResources
instance. Thus, its state may become inconsistent and you shouldn't use that instance for other operations.
You can apply it to a PdfDocument pdfDocument
like this:
PdfFormXObject replacement = new PdfFormXObject(new Rectangle(1, 1));
for (int pageNr = 1; pageNr <= pdfDocument.GetNumberOfPages(); pageNr++)
{
PdfPage pdfPage = pdfDocument.GetPage(pageNr);
PdfResources pdfResources = pdfPage.GetResources();
replaceImages(pdfResources, replacement.GetPdfObject());
}
(ReplaceImageWithEmptyObject test testReplaceForVZ
)
It works as expected with your example PDF.