Search code examples
pythonpython-re

How to print number after end() method?


I filter a substring from a string.

But now I want to print the number after the substring. in this case it is index number 475. And then I want to print that number.

So this is whole string:

Koopliedenweg 38\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 19-11-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK46\nVerdi Import Schoolfruit\n566 Ananas 
Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00\n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08\n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40\nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date

And then I have this substring:

Appels Royal Gala 13kg 60/65 Generica PL Klasse I

and then I want to print the number: 3.304,08.

So this number is on index 475.

But how to print now this number?

This is my code fragment:

pdfFile = wi(
    filename="C:\\Users\\engel\\Documents\\python\\docs\\fixedPDF.pdf", resolution=300)
image = pdfFile.convert('jpeg')

imageBlobs = []


for img in image.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('jpeg'))

text_factuur_verdi = []
appels_royal_gala = 'Appels Royal Gala 13kg 60/65 Generica PL Klasse I'
ananas_crownless = 'Ananas Crownless 14kg 10 Sweet CR Klasse I'
peen_waspeen = 'Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I'


for imgBlob in imageBlobs:
    image = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(image, lang='eng')
    text_factuur_verdi.append(text)
    allSubstring = re.search(appels_royal_gala, text)    
    #indexEndAllsubstring = ' '.join(allSubstring)
    
    print(allSubstring.end() + 11)

Or maybe there is a better way to do this?

in any case it is the second number after the substring:

Appels Royal Gala 13kg 60/65 Generica PL Klasse I
€ 4,68 € 3.304,08

Solution

  • text = "Koopliedenweg 38\nDeb. nr. : 108636 2991 LN BARENDRECHT\nYour VAT nr. : NL851703884B01 Nederland\nFactuur datum : 19-11-21\nAantal Omschrijving Prijs Bedrag\nOrder number : 76372 Loading date : 15-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date :\nWK46\nVerdi Import Schoolfruit\n566 Ananas Crownless 14kg 10 Sweet CR Klasse I € 7,00 € 3.962,00\n706 Appels Royal Gala 13kg 60/65 Generica PL Klasse I € 4,68 € 3.304,08\n598 Peen Waspeen 14x1lkg 200-400 Generica BE Klasse I € 6,30 3.767,40\nOrder number : 76462 Loading date : 18-11-21 Incoterm: : FOT\nYour ref. : SCHOOLFRUIT Delivery date"
    
    appels_royal_gala = 'Appels Royal Gala 13kg 60/65 Generica PL Klasse I'
    
    def make_pattern(substr):
        return r"(?<=" + substr + r").*?(?P<number>[0-9,.]*)\n"
    
    allSubstring = re.findall(make_pattern(appels_royal_gala), text)
    print(allSubstring[0])
    
    # Prints
    3.304,08
    

    If you care about the index, you can still use re.search and then you should do print(allSubstring[1]) (instead of 0).

    This solution assumes the number you're looking for is allways followed by \n, which seems to be a constant in your example.