I'm currently validating the correct order of the content in a Tagged PDF File.
Is there any way to extract the reading order numbers of Tagged PDF Files programmatically?
I've tried converting the tagged PDF to XML but I can't figure out which tags belong to a certain text.
I've tried the following Libraries:
but I can't find any methods that get its reading order numbers.
Is it really possible? Thanks in advance!
You can extract the marked content tree of tagged pdf using the PdfPig (.NET) library. My understanding is that the reading order is indicated by the Marked-content identifier (MCID).
If a marked content element does not contain an MCID (like pagination elements), the MCID is set to -1.
Each MarkedContentElement
will contain the letters, images and paths that belong to it:
using UglyToad.PdfPig;
[...]
using (PdfDocument document = PdfDocument.Open(pathToFile))
{
for (int p = 0; p < document.NumberOfPages; p++)
{
var page = document.GetPage(p + 1);
// extract the page's marked content
var markedContents = page.GetMarkedContents();
var orderedMarkedContents = markedContents
.OrderBy(mc => mc.MarkedContentIdentifier);
foreach (var mc in orderedMarkedContents)
{
// do something
}
}
}
If you want to extract the result to XML, you can have a look at the PageXmlTextExporter
class. Have a look at the wiki for more information on ITextExporter
and IReadingOrderDetector
.
Note: I am an active contributer to this library.