PDF files contain quantity, price and sum. Different pdfs have different columns. In some pdfs thousands are separated by spaces like
Description Price Quantity Sum
Soap 1 000.00 2.2 2 200.00
White 3 towel 10.00 2 20.00
How to get proper price and sum values? Tried iText 7
MemoryStream pdfStream = get pdf file contents
StringBuilder processed = new();
pdfStream.Position = 0;
using var pdfDocument = new PdfDocument(new PdfReader(pdfStream));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i) {
var page = pdfDocument.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, strategy);
processed.Append(text);
}
It returns all words separated by single space:
Soap 1 000.00 2.2 2 200.00
White 3 towel 10.00 2 20.00
Some rows contains vales less than 1000 and some more that 1000. It looks like it is not possible to get proper values from text only. How to get distance between words in row? If distance is single space, those words can merged into one number.
Using .NET 7.0 ASP.NET MVC controller.
Update
Tried XpdfNet from answer but got exception
System.IO.FileNotFoundException: Could not find file 'C:\myapp\bin\Debug\net7.0\5db7d64c-e1c5-4e1b-b14f-0162ce029c46.txt'.
File name: 'C:\myapp\bin\Debug\net7.0\5db7d64c-e1c5-4e1b-b14f-0162ce029c46.txt'
at Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
at System.IO.Strategies.FileStreamHelpers.ChooseStrategyCore(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
at System.IO.StreamReader.ValidateArgsAndOpenPath(String path, Encoding encoding, Int32 bufferSize)
at System.IO.File.ReadAllText(String path, Encoding encoding)
at XpdfNet.XpdfHelper.GetTextResult(XpdfParameter parameter)
at XpdfNet.XpdfHelper.ToText(String pdfFilePath, String arguments)
Without second argument
string content = pdfHelper.ToText("C:\\a\\test.pdf");
Works but produces single space delimited result just like iText.
I found a package XpdfNet
may help.
[HttpGet("test")]
public void test()
{
var pdfHelper = new XpdfHelper();
String content = pdfHelper.ToText("E:\\test.pdf","-table");
Console.WriteLine(content);
}