Search code examples
c#itext7

PdfTextExtractor.GetTextFromPage() returns empty string


I'm trying to extract the text from the following PDF with the following code (using iText7 7.2.2) :

var source = (string)GetHttpResult("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf", new CookieContainer());
var bytes = Encoding.UTF8.GetBytes(source);
var stream = new MemoryStream(bytes);
var reader = new PdfReader(stream);
var doc = new PdfDocument(reader);
var pages = doc.GetNumberOfPages();
var text = PdfTextExtractor.GetTextFromPage(doc.GetPage(1));

Loading the PDF in my browser (Edge 100.0) works fine.

GetHttpResult() is a simple HttpClient defining a custom CookieContainer, a custom UserAgent, and calling ReadAsStringAsync(). Nothing fancy.

source has the correct PDF content, starting with "%PDF-1.7".

pages has the correct number of pages, which is 2.

But, whatever I try, text is always empty.

Defining an explicit TextExtractionStrategy, trying some Encodings, extracting from all pages in a loop, ..., nothing matters, text is always empty, with no Exception thrown anywhere.

I think I don't read this PDF how it's "meant" to be read, but what is the correct way then (correct content in source, correct number of pages, no Exception anywhere) ?

Thanks.


Solution

  • That's it ! Thanks to mkl and KJ !

    I first downloaded the PDF as a byte array so I'm sure it's not modified in any way.

    Then, as pdftotext is able to extract the text from this PDF, I searched for a NuGet package able to do the same. I tested almost ten of them, and FreeSpire.PDF finally did it !

    Update : Actually, FreeSpire.PDF missed some words, so I finally found PdfPig, able to extract every single word.

    Code using PdfPig :

    using UglyToad.PdfPig;
    using UglyToad.PdfPig.Content;
    
    byte[] bytes;
    using (HttpClient client = new())
    {
        bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
    }
    
    List<string> words = new();
    using (PdfDocument document = PdfDocument.Open(bytes))
    {
        foreach (Page page in document.GetPages())
        {
            foreach (Word word in page.GetWords())
            {
                words.Add(word.Text);
            }
        }
    }
    
    string text = string.Join(" ", words);
    

    Code using FreeSpire.PDF :

    using Spire.Pdf;
    using Spire.Pdf.Exporting.Text;
    
    byte[] bytes;
    using (HttpClient client = new())
    {
        bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
    }
    
    string text = string.Empty;
    SimpleTextExtractionStrategy strategy = new();
    using (PdfDocument doc = new())
    {
        doc.LoadFromBytes(bytes);
        foreach (PdfPageBase page in doc.Pages)
        {
            text += page.ExtractText(strategy);
        }
    }