Extracting embedded XML File from PDF A/3 using abcpdf in C# - ZUGFeRD

I'm currently working with the new German ZUGFeRD files. These are PDF A/3 files who have an embedded XML file in them which contains data.

I want to extract this XML file from the PDF A/3 using abcpdf 8.1 with C#.

Any idea how to do this ?

Thanks a lot and regards,

Solution

I don't know abcpdf but I guess that the pdf libs offer similar access to the pdfs content.

First take a look at Das-ZUGFeRD-Format_1p0.pdf. Especially page 112. The images shows the object tree you have to traverse in order to find the xml stream.

With this tree you have the names, the types and the direction. Now you can traverse the pdf object tree to get to the XML content that you are looking for.

The steps based on the diagram.

Read your PDF
Get the catalog inside your PDF
Get the Array with name AF from Catalog
Get first element from AF array (should be file spec)
From file spec get the dictionary named EF
Get the stream content of EF

This are the steps you need to perform in order to get to the content.

To display the structure of a pdf and browse the tree I would recommend to use a tool like iText RUPS