Search code examples
javac#pdfaccessibilityacrobat

Extract Reading Order Sequence in a Tagged PDF


I'm currently validating the correct order of the content in a Tagged PDF File.

Is there any way to extract the reading order numbers of Tagged PDF Files programmatically?

Image Sample

I've tried converting the tagged PDF to XML but I can't figure out which tags belong to a certain text.

I've tried the following Libraries:

  • Syncfusion
  • IText7

but I can't find any methods that get its reading order numbers.

Is it really possible? Thanks in advance!


Solution

  • You can extract the marked content tree of tagged pdf using the PdfPig (.NET) library. My understanding is that the reading order is indicated by the Marked-content identifier (MCID).

    If a marked content element does not contain an MCID (like pagination elements), the MCID is set to -1.

    Each MarkedContentElement will contain the letters, images and paths that belong to it:

            using UglyToad.PdfPig;
            [...]
    
            using (PdfDocument document = PdfDocument.Open(pathToFile))
            {
                for (int p = 0; p < document.NumberOfPages; p++)
                {
                    var page = document.GetPage(p + 1);
    
                    // extract the page's marked content
                    var markedContents = page.GetMarkedContents(); 
    
                    var orderedMarkedContents = markedContents
                           .OrderBy(mc => mc.MarkedContentIdentifier);
    
                    foreach (var mc in orderedMarkedContents)
                    {
                        // do something
                    }
                }
            }
    

    If you want to extract the result to XML, you can have a look at the PageXmlTextExporter class. Have a look at the wiki for more information on ITextExporter and IReadingOrderDetector.

    Note: I am an active contributer to this library.