I'm trying to extract the text from the following PDF with the following code (using iText7 7.2.2) :
var source = (string)GetHttpResult("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf", new CookieContainer());
var bytes = Encoding.UTF8.GetBytes(source);
var stream = new MemoryStream(bytes);
var reader = new PdfReader(stream);
var doc = new PdfDocument(reader);
var pages = doc.GetNumberOfPages();
var text = PdfTextExtractor.GetTextFromPage(doc.GetPage(1));
Loading the PDF in my browser (Edge 100.0) works fine.
GetHttpResult()
is a simple HttpClient defining a custom CookieContainer, a custom UserAgent, and calling ReadAsStringAsync(). Nothing fancy.
source
has the correct PDF content, starting with "%PDF-1.7".
pages
has the correct number of pages, which is 2.
But, whatever I try, text
is always empty.
Defining an explicit TextExtractionStrategy, trying some Encodings, extracting from all pages in a loop, ..., nothing matters, text
is always empty, with no Exception thrown anywhere.
I think I don't read this PDF how it's "meant" to be read, but what is the correct way then (correct content in source
, correct number of pages, no Exception anywhere) ?
Thanks.
That's it ! Thanks to mkl and KJ !
I first downloaded the PDF as a byte array so I'm sure it's not modified in any way.
Then, as pdftotext is able to extract the text from this PDF, I searched for a NuGet package able to do the same. I tested almost ten of them, and FreeSpire.PDF finally did it !
Update : Actually, FreeSpire.PDF missed some words, so I finally found PdfPig, able to extract every single word.
Code using PdfPig :
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
List<string> words = new();
using (PdfDocument document = PdfDocument.Open(bytes))
{
foreach (Page page in document.GetPages())
{
foreach (Word word in page.GetWords())
{
words.Add(word.Text);
}
}
}
string text = string.Join(" ", words);
Code using FreeSpire.PDF :
using Spire.Pdf;
using Spire.Pdf.Exporting.Text;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
string text = string.Empty;
SimpleTextExtractionStrategy strategy = new();
using (PdfDocument doc = new())
{
doc.LoadFromBytes(bytes);
foreach (PdfPageBase page in doc.Pages)
{
text += page.ExtractText(strategy);
}
}