I'm trying to read the appearance stream of a PDF annotation, using iTextSharp, and get the content text from the stream.
I'm using the following code:
public String ExtractAnnotationText(PdfStream xObject)
{
PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
byte[] contentByteArray = ContentByteUtils.GetContentBytesFromContentObject(xObject);
processor.ProcessContent(contentByteArray, resources);
return strategy.GetResultantText();
}
xObject
is retrieved from the appearance dictionary and passed in like this:
PRStream value = (PRStream)appearancesDictionary.GetAsStream(key);
String text = ExtractAnnotationText(value);
This generally works well for getting the appearance text from annotations, but I found an example of a FreeTextCallout where xObject
doesn't have a /Resources
key, as shown by its hashMap:
[/Type, /XObject]
[/Subtype, /Form]
[/FormType, 1]
[/Length, 71]
[/Matrix, [1, 0, 0, 1, -28.7103, -643.893]]
[/BBox, [28.7103, 643.893, 597.85, 751.068]]
[/Filter, /FlateDecode]
In this case, is there another way to construct a Resources
dictionary for passing to PdfContentStreamProcessor.ProcessContent()
? Or even a different way to get the text without using ProcessContent()
?
On this the pdf specification declares:
A resource dictionary shall be associated with a content stream in one of the following ways:
For a content stream that is the value of a page’s Contents entry (or is an element of an array that is the value of that entry), the resource dictionary shall be designated by the page dictionary’s Resources or is inherited, as described under 7.7.3.4, "Inheritance of Page Attributes," from some ancestor node of the page object.
For other content streams, a conforming writer shall include a Resources entry in the stream's dictionary specifying the resource dictionary which contains all the resources used by that content stream. This shall apply to content streams that define form XObjects, patterns, Type 3 fonts, and annotation.
PDF files written obeying earlier versions of PDF may have omitted the Resources entry in all form XObjects and Type 3 fonts used on a page. All resources that are referenced from those forms and fonts shall be inherited from the resource dictionary of the page on which they are used. This construct is obsolete and should not be used by conforming writers.
(section 7.8.3 - Resource Dictionaries - of ISO 32000-1)
Thus, the example you found either is a case of that third option, or the example simply needs no resources at all, or your example file simply is broken.