Search code examples
c#pdfitext

c# How to cast from 'iTextSharp.text.pdf.PdfArray' to 'iTextSharp.text.pdf.PRIndirectReference'


I was using this piece of code till today and it was working fine:

for (int page = 1; page <= reader.NumberOfPages; page++)
{
    var cpage = reader.GetPageN(page);
    var content = cpage.Get(PdfName.CONTENTS);

    var ir = (PRIndirectReference)content;

    var value = reader.GetPdfObject(ir.Number);

    if (value.IsStream())
    {
        PRStream stream = (PRStream)value;

        var streamBytes = PdfReader.GetStreamBytes(stream);

        var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

        try
        {
            while (tokenizer.NextToken())
            {
                if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                {
                    string strs = tokenizer.StringValue;

                    if (!(br = excludeList.Any(st => strs.Contains(st))))
                    {
                        //strfor += tokenizer.StringValue;

                        if (!string.IsNullOrWhiteSpace(strs) &&
                            !stringsList.Any(i => i == strs && excludeHeaders.Contains(strs)))
                            stringsList.Add(strs);
                    }
                }
            }
        }
        finally
        {
            tokenizer.Close();
        }
    }
}

But today I got an exception for some pdf file: Unable to cast object of type 'iTextSharp.text.pdf.PdfArray' to type 'iTextSharp.text.pdf.PRIndirectReference

On debugging I got to know that the error is at this line: var ir = (PRIndirectReference)content;. That's because the pdf content that I'm extracting, I get it in the form of ArrayList, as you can see from the below image:

content

It would be really grateful if anyone can help me with this. Thanks in advance.

EDIT :

The pdf contents are paragraphs, tables, headers & footers, images in few cases. But I'm not bothered of images as I'm bypassing them.

As you can see from the code I'm trying to add the words into a string list, so I expect the output as plain text; words to be specific.


Solution

  • That was real easy! Don't know why I couldn't make out.

    PdfReader reader = new PdfReader(name);
    List<string> stringsList = new List<string>();
    
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        //directly get the contents into a byte stream
        var streamByte = reader.GetPageContent(page);
        var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamByte));
        var sb = new StringBuilder(); //use a string builder instead
    
        try
        {
            while (tokenizer.NextToken())
            {
                if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                {
                    var currentText = tokenizer.StringValue;
                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    sb.Append(tokenizer.StringValue);
                }
            }
        }
        finally
        {
            //add appended strings into a string list
            if(sb != null)
                stringsList.Add(sb.ToString());
    
            tokenizer.Close();
        }
    }