Slow access to AcroFields (iTextSharp)

I'm using iTextSharp to extract SignatureNames from a PDF. I encountered problems (excessive slowness) accessing AcroFiels of big size and many pages PDF (~40MB and ~5000 pages).

Here my code snippet:

using iTextSharp.text.pdf;

private static List<byte[]> GetSignsFromPDF(string filePath)
{
    var result = new List<byte[]>();
    var randomAccessFileOrArray = new RandomAccessFileOrArray(filePath);
    var reader = new PdfReader(randomAccessFileOrArray, null);
    var fields = reader.AcroFields;

    if (fields == null)
    {
        return result;
    }

    var signatureNames = fields.GetSignatureNames();
    signatureNames.Sort();

    foreach (string name in signatureNames)
    {
        var sigDict = fields.GetSignatureDictionary(name);
        var contents = sigDict.GetAsString(PdfName.CONTENTS);

        if (contents != null)
        {
            result.Add(contents.GetOriginalBytes());
        }
    }

    return result;
}

There is a smarter/faster way to access AcroFields or should I wait iTextSharp stuff?

Thanks a lot.

Solution

In the comments the conjecture came up that the excessive slowness is due to the fact that iText(Sharp) during initialization of the field collection in an AcroFields instance not only inspects the fields referenced in Catalog -> AcroForm -> Fields but also (actually foremost) from the ANNOTS of all document pages.

Fortunately this initialization does not take place in the AcroFields constructor, so we can inject a field collection retrieved without inspecting all the pages.

The following method is a copy of the internal AcroFields method Fill (which is responsible for the lazy initialization) with the page traversal removed and with access to hidden members enabled via reflection. It can be used to test the conjecture.

void fill(PdfReader reader, AcroFields acroFields)
{
    IDictionary<string, AcroFields.Item> fields = new LinkedDictionary<string, AcroFields.Item>();
    PdfDictionary top = (PdfDictionary)PdfReader.GetPdfObjectRelease(reader.Catalog.Get(PdfName.ACROFORM));
    if (top == null)
        return;
    PdfBoolean needappearances = top.GetAsBoolean(PdfName.NEEDAPPEARANCES);
    if (needappearances == null || !needappearances.BooleanValue)
        acroFields.GenerateAppearances = true;
    else
        acroFields.GenerateAppearances = false;
    PdfArray arrfds = (PdfArray)PdfReader.GetPdfObjectRelease(top.Get(PdfName.FIELDS));
    if (arrfds == null || arrfds.Size == 0)
        return;

    System.Reflection.FieldInfo valuesField = typeof(AcroFields.Item).GetField("values", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    System.Reflection.FieldInfo widgetsField = typeof(AcroFields.Item).GetField("widgets", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    System.Reflection.FieldInfo widgetRefsField = typeof(AcroFields.Item).GetField("widget_refs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    System.Reflection.FieldInfo mergedField = typeof(AcroFields.Item).GetField("merged", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    System.Reflection.FieldInfo pageField = typeof(AcroFields.Item).GetField("page", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    System.Reflection.FieldInfo tabOrderField = typeof(AcroFields.Item).GetField("tabOrder", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);

    for (int j = 0; j < arrfds.Size; ++j)
    {
        PdfDictionary annot = arrfds.GetAsDict(j);
        if (annot == null)
        {
            PdfReader.ReleaseLastXrefPartial(arrfds.GetAsIndirectObject(j));
            continue;
        }
        if (!PdfName.WIDGET.Equals(annot.GetAsName(PdfName.SUBTYPE)))
        {
            PdfReader.ReleaseLastXrefPartial(arrfds.GetAsIndirectObject(j));
            continue;
        }
        PdfArray kids = (PdfArray)PdfReader.GetPdfObjectRelease(annot.Get(PdfName.KIDS));
        if (kids != null)
            continue;
        PdfDictionary dic = new PdfDictionary();
        dic.Merge(annot);
        PdfString t = annot.GetAsString(PdfName.T);
        if (t == null)
            continue;
        String name = t.ToUnicodeString();
        if (fields.ContainsKey(name))
            continue;
        AcroFields.Item item = new AcroFields.Item();
        fields[name] = item;
        ((List<PdfDictionary>)valuesField.GetValue(item)).Add(dic); // item.AddValue(dic);
        ((List<PdfDictionary>)widgetsField.GetValue(item)).Add(dic); // item.AddWidget(dic);
        ((List<PdfIndirectReference>)widgetRefsField.GetValue(item)).Add(arrfds.GetAsIndirectObject(j)); //item.AddWidgetRef(arrfds.GetAsIndirectObject(j)); // must be a reference
        ((List<PdfDictionary>)mergedField.GetValue(item)).Add(dic); // item.AddMerged(dic);
        ((List<int>)pageField.GetValue(item)).Add((int)-1); // item.AddPage(-1);
        ((List<int>)tabOrderField.GetValue(item)).Add((int)-1); // item.AddTabOrder(-1);
    }

    System.Reflection.FieldInfo fieldsField = typeof(AcroFields).GetField("fields", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    fieldsField.SetValue(acroFields, fields);
}

It should be called for an AcroFields instance as early as possible, e.g.:

using (PdfReader reader = new PdfReader(file))
{
    AcroFields acroFields = reader.AcroFields;
    fill(reader, acroFields);
    ...

If using this method reduces the time considerably (while at the same time supplying the desired fields), the conjecture is confirmed.

Looking at the code one recognizes that it does not properly walk the field structure: Fields may be arranged hierarchically but the code only considers the first level elements. It should suffice for a first test of the conjecture mentioned above, though.