Search code examples
asp.net-mvcasp.net-corepdfitextitext7

How to get numbers from pdf if thousands are separated


PDF files contain quantity, price and sum. Different pdfs have different columns. In some pdfs thousands are separated by spaces like

Description       Price   Quantity          Sum
Soap           1 000.00        2.2     2 200.00
White 3 towel     10.00          2        20.00

How to get proper price and sum values? Tried iText 7

MemoryStream pdfStream = get pdf file contents
StringBuilder processed = new();
pdfStream.Position = 0;
using var pdfDocument = new PdfDocument(new PdfReader(pdfStream));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i) {
  var page = pdfDocument.GetPage(i);
  string text = PdfTextExtractor.GetTextFromPage(page, strategy);
  processed.Append(text);
  }

It returns all words separated by single space:

Soap 1 000.00 2.2 2 200.00
White 3 towel 10.00 2 20.00

Some rows contains vales less than 1000 and some more that 1000. It looks like it is not possible to get proper values from text only. How to get distance between words in row? If distance is single space, those words can merged into one number.

Using .NET 7.0 ASP.NET MVC controller.

Update

Tried XpdfNet from answer but got exception

System.IO.FileNotFoundException: Could not find file 'C:\myapp\bin\Debug\net7.0\5db7d64c-e1c5-4e1b-b14f-0162ce029c46.txt'.
File name: 'C:\myapp\bin\Debug\net7.0\5db7d64c-e1c5-4e1b-b14f-0162ce029c46.txt'
   at Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.FileStreamHelpers.ChooseStrategyCore(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.StreamReader.ValidateArgsAndOpenPath(String path, Encoding encoding, Int32 bufferSize)
   at System.IO.File.ReadAllText(String path, Encoding encoding)
   at XpdfNet.XpdfHelper.GetTextResult(XpdfParameter parameter)
   at XpdfNet.XpdfHelper.ToText(String pdfFilePath, String arguments)

Without second argument

string content = pdfHelper.ToText("C:\\a\\test.pdf");

Works but produces single space delimited result just like iText.


Solution

  • I found a package XpdfNet may help.
    enter image description here

            [HttpGet("test")]
            public void test()
            {
                var pdfHelper = new XpdfHelper();
                String content = pdfHelper.ToText("E:\\test.pdf","-table");
                Console.WriteLine(content);
            }
    

    output
    enter image description here