I'm trying to get the location of a specific string in PDF generated from different sources (Word, Excel, CorelDraw..). I have extracted part of the code that in my opinion is relevant to this question:
private static iText.Kernel.Geom.Rectangle? textRect;
private static readonly string specificText = "Test";
private void BtnNewPDF_Click(object sender, RoutedEventArgs e)
{
//Initialize PDF document
PdfDocument pdfDoc = new(new PdfReader(@"D:\Word-test.pdf"));
using Document document = new(pdfDoc);
// Get the specified page from the PDF document
PdfPage pdfPage = pdfDoc.GetPage(1);
// Create a PdfCanvasProcessor object
PdfCanvasProcessor canvasProcessor = new(new MyEventListener());
// Process the page
canvasProcessor.ProcessPageContent(pdfPage);
if (textRect is not null)
{
MessageBox.Show(textRect.GetX().ToString());
MessageBox.Show(textRect.GetY().ToString());
}
document.Close();
}
class MyEventListener : IEventListener
{
public void EventOccurred(IEventData data, EventType type)
{
if (type == EventType.RENDER_TEXT)
{
// Cast the IEventData to TextRenderInfo
TextRenderInfo renderInfo = (TextRenderInfo)data;
//See if the current chunk contains the text
var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), specificText);
//If not found bail
if (startPosition < 0) { return; }
// Get the bounding box for the text
textRect = renderInfo.GetDescentLine().GetBoundingRectangle();
}
}
public ICollection<EventType> GetSupportedEvents()
{
return new List<EventType> { EventType.RENDER_TEXT };
}
}
When I'm Working with pdf generated from CorelDraw I'm able to get the whole string ("Test") in EventData. but when I'm using PDF generated from Microsoft Word I'm getting chunks like "T" and "est".
Am I using the wrong procedure to get the location of the string or can I make some changes in Microsoft Word when generating PDF? If nothing of this can't be done, I'll need to make some code to concatenate the letters to get searched string and get location. I don't know how to attack this problem. Can somebody tell me, in general, how can this be solved?
PDF content is drawn according to the drawing instructions in several kinds of content streams therein. These instructions may draw text in different manners, they may draw a whole text line in one instruction, they may draw it using separate instructions for each word, ... They may even draw it using separate instructions for each letter.
iText forwards you each string argument of a text drawing instruction in a separate event. Thus, your observation essentially means that CorelDraw draws larger line pieces at once than Word does. Most likely Word draws "T" and "est" in separate instructions to apply kerning in-between.
So essentially you indeed
need to make some code to concatenate the letters to get searched string and get location.
In that context you wonder
how to attack this problem. Can somebody tell me, in general, how can this be solved?
iText includes example event listeners for text extraction, e.g. the SimpleTextExtractionStrategy
and the LocationTextExtractionStrategy
. As you are trying to not only extract the text but also locations, the newer RegexBasedLocationExtractionStrategy
might be of special interest to you.
If you cannot use the RegexBasedLocationExtractionStrategy
as is, you can at least look at its code for inspiration. iText is open source after all...