I have a document with one page and when i try to apply a PDF_A_2A conformance level I have the exception below
iText.Kernel.Exceptions.PdfException: Unexpected end of file.
at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.ReadDictionary()
at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.Parse(IList`1 ls)
at iText.Pdfa.Checker.PdfA1Checker.CheckContentStream(PdfStream contentStream)
at iText.Pdfa.Checker.PdfAChecker.CheckPage(PdfPage page)
at iText.Pdfa.Checker.PdfAChecker.CheckPages(PdfDocument document)
at iText.Pdfa.Checker.PdfAChecker.CheckDocument(PdfCatalog catalog)
at iText.Kernel.Utils.ValidationContainer.Validate(ValidationContext context)
at iText.Kernel.Pdf.PdfDocument.Close()
Im using this piece of code to go through the original file and copy the pages one by one to the new PdfADocument
using (PdfADocument pdfADoc = new PdfADocument(writer, pdfAConformanceLevel, CreateOutputIntent(false)))
{
pdfADoc.GetDocumentInfo().SetCreator("Scan2x");
using (PdfDocument existingDoc = new PdfDocument(pdfReader))
{
int numberOfPages = existingDoc.GetNumberOfPages();
for (int i = 1; i <= numberOfPages; i++)
{
PdfPage page = existingDoc.GetPage(i);
//PdfDictionary pageDictionary = page.GetPdfObject();
pdfADoc.AddPage(page.CopyTo(pdfADoc));
}
//existingDoc.CopyPagesTo(1, existingDoc.GetNumberOfPages(), pdfADoc);
pdfADoc.SetTagged();
pdfADoc.GetCatalog().SetLang(new PdfString("en-US"));
pdfADoc.GetCatalog().SetViewerPreferences(new PdfViewerPreferences().SetDisplayDocTitle(true));
}
}
private static PdfOutputIntent CreateOutputIntent(bool hasCmykOutputIntent)
{
string exeDirectory = AppDomain.CurrentDomain.BaseDirectory;
string colourProfilePath = "";
if (hasCmykOutputIntent)
colourProfilePath = Path.Combine(exeDirectory, "CMYK.icc");
else
colourProfilePath = Path.Combine(exeDirectory, "sRGBColor.icm");
Stream iccProfileStream = File.OpenRead(colourProfilePath);
return new PdfOutputIntent("Custom", "", "http://www.color.org", hasCmykOutputIntent ? "CMYK" : "sRGB IEC61966-2.1", iccProfileStream);
}
My page dictionary is
<</Contents [114 0 R 115 0 R 116 0 R 117 0 R 118 0 R 119 0 R 120 0 R 121 0 R ] /CropBox [0 0 594.96 842.04 ] /Group <</CS /DeviceRGB /S /Transparency /Type /Group >> /MediaBox [0 0 594.96 842.04 ] /Parent 108 0 R /Resources <</ColorSpace <</CS0 132 0 R >> /Font <</C2_0 137 0 R /C2_1 143 0 R /TT0 146 0 R /TT1 149 0 R >> /ProcSet [/PDF /Text /ImageC /ImageI ] /XObject <</Im0 129 0 R >> >> /Rotate 0 /StructParents 0 /Tabs /S /Type /Page >>
First of all, your code does not what you hope it does: You appear to hope that when you add pages to a PdfADocument
from some arbitrary PdfDocument
, these classes ensure that the result is PDF/A compliant. This is not the case.
PdfADocument
helps you create PDF/A compliant documents. In particular it adds the claim that the document is PDF/A compliant and executes a number of tests checking some PDF/A requirements. But it doesn't do magic. It doesn't make imported PDF contents PDF/A compliant.
Furthermore, by copying pages from other documents you sometimes even lose some important data. E.g. the existing structure tree data get lost.
Nonetheless, with your example file you have found a deficiency in the iText routines checking for PDF/A compliance.
When closing the document, one of these test routines visits all the page content streams, and this test routine stumbles over a seldom structure copied from your example: The contents of the page are spread over multiple streams and in one case the cut between streams occurred inside a marked content dictionary.
Object 120:
...
q
435.94 185.18 90.984 14.64 re
W* n
/Span <</Lang
Object 121:
(en-US)/MCID 43 >>BDC
EMC
Q
...
This is something the test routine cannot handle: PdfAChecker.CheckPage
iterates over all content streams of the page it checks and calls PdfA1Checker.CheckContentStream
for each stream. In that method the current content stream is parsed and each object in it is checked. Unfortunately the parser expects dictionaries to be complete. Thus, it stumbles over the dictionary starting piece <</Lang
at the end of object 120 and throws
iText.Kernel.Exceptions.PdfException: Unexpected end of file.
at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.ReadDictionary()
at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.Parse(IList`1 ls)
at iText.Pdfa.Checker.PdfA1Checker.CheckContentStream(PdfStream contentStream)
at iText.Pdfa.Checker.PdfAChecker.CheckPage(PdfPage page)
as you have observed.
Most likely this usually is no problem as content streams generated by iText itself don't get split like that.
To prevent running into that specific issue, you can preprocess the document you want to import pages from by concatenating all content streams into a single one, e.g. by retrieving the concatenated contents using PdfPage.GetContentBytes()
, removing all content streams, creating a new content stream, and adding the concatenated contents to that new stream.