Search code examples
c#itextitext7

Highlight words in an existing PDF document using iText7 and C#


I've been floundering and going in circles with this problem for a couple of days now. I'm hoping someone here can help.

I have a PDF document in a filestream that I'd like to use iText7 8.0.3 from C# to find all the instances of the keyword 'and' and highlight them with a red background and then save the document back to a memory stream then to a copy of the pdf.

Here's my code which almost works, it does render the red backgrounds but just in the wrong relative locations: -

using iText.Kernel.Colors;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System.IO;

FileStream src = new FileStream("C:\\Temp\\34207180.pdf", FileMode.Open);
MemoryStream ms = new MemoryStream();

string keyword = "and";

PdfDocument pdfDoc = new PdfDocument(new PdfReader(src), new PdfWriter(ms));

int pdfPages = pdfDoc.GetNumberOfPages();

for (int page = 1; page <= pdfPages; page++)
{

    Regex regex = new Regex(keyword, RegexOptions.IgnoreCase);

    RegexBasedLocationExtractionStrategy extractionStrategy = new RegexBasedLocationExtractionStrategy(regex);

    PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);

    parser.ProcessPageContent(pdfDoc.GetPage(page));

    List<IPdfTextLocation> locs = extractionStrategy.GetResultantLocations().ToList();

    PdfCanvas pdfCanvas = new PdfCanvas(pdfDoc.GetPage(page).NewContentStreamAfter(), pdfDoc.GetPage(page).GetResources(), pdfDoc);

    foreach (var l in locs)
    {

        pdfCanvas
            .SaveState()
            .SetFillColor(ColorConstants.RED)
            .Rectangle(l.GetRectangle().GetX(), l.GetRectangle().GetY(), l.GetRectangle().GetWidth(), l.GetRectangle().GetHeight())
            .Fill()
            .RestoreState();

    }

}

pdfDoc.Close();

byte[] img = ms.ToArray();
File.WriteAllBytes("C:\\Temp\\34207180-dest.pdf", img);

And here are the example input and outputted PDF files, Source Destination

Can anyone explain what's going on? It's like GetResultantLocations is returning values in a different scale to that required of PdfCanvas Rectangle Fill.

I have read many articles on this site and elsewhere to no resolution.


Solution

  • The drawing you do is affected by a transformation matrix set in the original content.

    To be unaffected by any active transformation matrix, you could use the Canvas constructor PdfCanvas(PdfPage page, bool wrapOldContent). This wrapOldContent will wrap existing content with save state and restore state resulting in a pristine state.

    The drawn rectangles will block out the text. That can be fixed by setting the blend mode to multiply. Canvas.SetExtGState(new PdfExtGState().SetBlendMode(PdfExtGState.BM_MULTIPLY))

    I have updated part of your code to reflect these and some other changes:

        PdfCanvas pdfCanvas = new PdfCanvas(pdfDoc.GetPage(page), true);
        pdfCanvas.SaveState();
        pdfCanvas.SetFillColor(ColorConstants.RED);
        pdfCanvas.SetExtGState(new PdfExtGState().SetBlendMode(PdfExtGState.BM_MULTIPLY));
        
        foreach (var l in locs)
        {
            pdfCanvas
                .Rectangle(l.GetRectangle().GetX(), l.GetRectangle().GetY(), l.GetRectangle().GetWidth(),
                    l.GetRectangle().GetHeight())
                .Fill();
        }
        pdfCanvas.RestoreState();