I filter a substring from a string.
But now I want to print the number after the substring. in this case it is index number 475. And then I want to print that number.
So this is whole string:
Koopliedenweg 38\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 19-11-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK46\nVerdi Import Schoolfruit\n566 Ananas
Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00\n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08\n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40\nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date
And then I have this substring:
Appels Royal Gala 13kg 60/65 Generica PL Klasse I
and then I want to print the number: 3.304,08.
So this number is on index 475.
But how to print now this number?
This is my code fragment:
pdfFile = wi(
filename="C:\\Users\\engel\\Documents\\python\\docs\\fixedPDF.pdf", resolution=300)
image = pdfFile.convert('jpeg')
imageBlobs = []
for img in image.sequence:
imgPage = wi(image=img)
imageBlobs.append(imgPage.make_blob('jpeg'))
text_factuur_verdi = []
appels_royal_gala = 'Appels Royal Gala 13kg 60/65 Generica PL Klasse I'
ananas_crownless = 'Ananas Crownless 14kg 10 Sweet CR Klasse I'
peen_waspeen = 'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I'
for imgBlob in imageBlobs:
image = Image.open(io.BytesIO(imgBlob))
text = pytesseract.image_to_string(image, lang='eng')
text_factuur_verdi.append(text)
allSubstring = re.search(appels_royal_gala, text)
#indexEndAllsubstring = ' '.join(allSubstring)
print(allSubstring.end() + 11)
Or maybe there is a better way to do this?
in any case it is the second number after the substring:
Appels Royal Gala 13kg 60/65 Generica PL Klasse I
€ 4,68 € 3.304,08
text = "Koopliedenweg 38\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 19-11-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK46\nVerdi Import Schoolfruit\n566 Ananas Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00\n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08\n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40\nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date"
appels_royal_gala = 'Appels Royal Gala 13kg 60/65 Generica PL Klasse I'
def make_pattern(substr):
return r"(?<=" + substr + r").*?(?P<number>[0-9,.]*)\n"
allSubstring = re.findall(make_pattern(appels_royal_gala), text)
print(allSubstring[0])
# Prints
3.304,08
If you care about the index, you can still use re.search
and then you should do print(allSubstring[1])
(instead of 0
).
This solution assumes the number you're looking for is allways followed by \n
, which seems to be a constant in your example.