Extract Reading Order Sequence in a Tagged PDF

I'm currently validating the correct order of the content in a Tagged PDF File.

Is there any way to extract the reading order numbers of Tagged PDF Files programmatically?

Image Sample

I've tried converting the tagged PDF to XML but I can't figure out which tags belong to a certain text.

I've tried the following Libraries:

Syncfusion
IText7

but I can't find any methods that get its reading order numbers.

Is it really possible? Thanks in advance!

Solution

You can extract the marked content tree of tagged pdf using the PdfPig (.NET) library. My understanding is that the reading order is indicated by the Marked-content identifier (MCID).

If a marked content element does not contain an MCID (like pagination elements), the MCID is set to -1.

Each MarkedContentElement will contain the letters, images and paths that belong to it:

        using UglyToad.PdfPig;
        [...]

        using (PdfDocument document = PdfDocument.Open(pathToFile))
        {
            for (int p = 0; p < document.NumberOfPages; p++)
            {
                var page = document.GetPage(p + 1);

                // extract the page's marked content
                var markedContents = page.GetMarkedContents(); 

                var orderedMarkedContents = markedContents
                       .OrderBy(mc => mc.MarkedContentIdentifier);

                foreach (var mc in orderedMarkedContents)
                {
                    // do something
                }
            }
        }

If you want to extract the result to XML, you can have a look at the PageXmlTextExporter class. Have a look at the wiki for more information on ITextExporter and IReadingOrderDetector.

Note: I am an active contributer to this library.