Trying to read ascii data from simple pdf file in C# .NET 8 using iText7
StringBuilder processed = new StringBuilder();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, strategy);
processed.Append(text);
}
returns garbled text
����������\n�������������������������\n
Text can copied from PDF in Adobe Acrobat PDF viewer.
PDF is in
https://wetransfer.com/downloads/e21c2093f9a732287383fc5ca97104cd20240414124039/b3ad1e
How to read text from this PDF ? Can iText configured or some other PDF reading librady used ?
This is already asked in
Reading text from pdf with iText7 + C#, text not recognized
but not solved.
The PDF file has a bad font table and a secondary problem of overset text (overlapping).
There are 2 ways to deal with a file like this (that looks good in Acrobat) but mis behaves in other applications and both are reprint.
I tried several ways and thus found Acrobat Print to GhostScript Postscript was best since PDF to PDF simply carried the errors over. Then convert the Postscript in GhostScript back to PDF.
You could use the old command line method of run Poppler/Xpdf PDFtoPS or GhostScript convert to PS then GhostScript PS back to PDF. But it will still have the overset text.
Once the fonts are resolved you could try again with iTeXt or I simply ran poppler PDFtoTeXT so as to remove the overset. Link to latest current 64bit version. For Windows 32 bit users and others you may need to use Xpdf command line tools: currently v4.05 (https://www.xpdfreader.com/download.html). So for example with poppler and target page 1.
NOTE with this particular file, Poppler performed noticeably better than Xpdf. However your results may vary depending on language, input, platform and settings!
pdftotext -layout -nopgbrk -enc UTF-8 -fixed 3.1 "in-fixed.pdf" "out.txt"
Result
Tarnija: Rolling OÜ
Registrikood / KMKR nr.: 12571469 / EE101695107
Aadress: Betooni tn 1, Tallinn, 11415
Kaupade väljastamiskoht: Betooni tn 1, Tallinn, 11415
Panga nimetus: SEB PANK AS, EEUHEE2X
Arve: 2403793 EST 09.04.2024
Panga konto: EE131010220225504229
Kontaktinformatsioon: Telefon: 6 280 808; Faks: 6 280 809
Maksekuupäev: 09.05.2024 Teie kontaktisik: Angelina Burtseva
Vedaja: ROLLING OÜ Saaja: DGM SHIPPING, AS
Registreerimisnumber: 12571469
Sõiduki juhi eesnimi, perekonnanimi: Kliendi number 71771
Sõiduki registreerimisnumber: Registrikood / KMKR nr.: 10061617 / EE100412077
Tehingu kirjeldus: kaupade tarnimine Aadress: A. Puškini tn 9-6, Narva, 20309
Piirkond: NEW - EAST TALLINN Kaupade kättesaamiskoht: Nurmevälja tn 1 Maardu Harjumaa 74114
Panga nimetus:
Panga konto:
Makseviis: 30 days Maksja: DGM SHIPPING, AS
Kommentaar:
Kaubad on tarnija omand seni, kuni nende eest on täielikult tasutud
Kogus/ Hind Summa
Artikkel Nimetus %
Ühik ilma KM ilma KM
Saateleht 2405326 [STOCK] (EAST)
Nurmevälja tn 1 Maardu Harjumaa 74114
17-02-0206 Paberrätik Tork Basic M2, 280 m, 1 kiht, valge 6 rulli, 140002 1 IEPAK 30.05 4.1 28.82
17-26-0269 Tänavahari Saima Broom, 25 x 120 cm, must/punane 1 GAB 9.16 4.9 8.71
08-05-0005 Käärid Office Point, 15.5 cm, Soft-Grip 1 TÜKK 1.35 17.5 1.11
04-07-0238 Ilmastikukindlad kleebisetiketid Herma 99.1 x 93.1 mm, 25 lehte, valged, väga tugeva kinnitusega 1 IEPAK 30.85 15.7 26.01
17-01-0128 Tualettpaber Jumbo Clean, 12 rulli, 2 kihti, valge 1 IEPAK 23.80 4.5 22.73
Saateleht 2404987 [STOCK] (EAST),DMM, Viidud, 03.04.2024, 13:46
Nurmevälja tn 1 Maardu Harjumaa 74114
04-02-0225 Märkmepaber Forofis, 51x76mm,100p., kollane 3 GAB 0.46 20.4 1.10
...
continued down to
...
05-01-0010 Ümbrikud iseliimuva kleepribaga, C5, 25 tk. Valge 2 PK 1.37 17.0 2.27
05-01-0008 Ümbrikud iseliimuva kleepribaga, C6, 25 tk. Valge 4 PK 0.87 17.1 2.88
01-05-0197 Sissetõmmatav geelist tindipliiats ErichKrause Smart-Gel, 0,5 mm, tindi värv: sinine 12GAB 0.60 18.4 5.88
In comments the discussion showed only a small section of page 1 and 4 was desired, and that can be done direct on the poor quality source via poppler PDFtoText.
For example to extract a line via a word or two or an area we can use,
pdftotext -f 1 -l 1 -layout -enc UTF-8 "rolling garbage Arve_24037931.pdf" - 2>nul |Findstr "Rolling EST KMKR Makseku" >temp.txt
So far that gives from page 1 - Note the query cannot include UTF accents but result can!
Tarnija: Rolling OÜ
Registrikood / KMKR nr.: 12571469 / EE101695107
Arve: 2403793 EST 09.04.2024
Maksekuupäev: 09.05.2024 Teie kontaktisik: Angelina Burtseva
Sõiduki registreerimisnumber: Registrikood / KMKR nr.: 10061617 / EE100412077
Similar we can add data from bottom of page 4
pdftotext -f 4 -l 4 -layout -enc UTF-8 "rolling garbage Arve_24037931.pdf" - 2>nul |Findstr "Summa 22% Kokku" >>temp.txt
so now we have core data.
Tarnija: Rolling OÜ
Registrikood / KMKR nr.: 12571469 / EE101695107
Arve: 2403793 EST 09.04.2024
Maksekuupäev: 09.05.2024 Teie kontaktisik: Angelina Burtseva
Sõiduki registreerimisnumber: Registrikood / KMKR nr.: 10061617 / EE100412077
Kogus/ Hind Summa
Summa ilma käibemaksutaEUR 370.37
*2403793r* 22%EUR 81.48
KokkuEUR 451.85