Search code examples
pythonregexpdftext-extractionpdfplumber

how to do complex pdf extraction with regex


I have a PDF file which contains Lottery Tickets winners, i want to extract all win tickets according to their prizes.

PDF file

i tried this:

import re
import pdfplumber

prize_re = re.compile(r"^\d[a-z]")
cons_prize_re = re.compile(r"^Cons")
ticket1_line_re = re.compile(r"^\d[)]")
ticket2_line_re = re.compile(r"^\d{4}")
ticket3_line_re = re.compile(r"[A-Z] \d{6}")

with pdfplumber.open("./test11.pdf") as pdf:
    for i in range(len(pdf.pages)):
        page_text = pdf.pages[i].extract_text()

        for line in page_text.split("\n"):
            if prize_re.match(line) or cons_prize_re.match(line) or ticket1_line_re.match(line) or ticket2_line_re.match(line) or ticket3_line_re.search(line):
                print(line)

and i got this, i don't know how to assign each ticket to its prize, also Cons prizes tickets number seems a little bit strange i don't know why (AN 867952AO 867952AP shoud be => AN 867952 AO 867952 AP...):

1st Prize Rs :7000000/- 1) AU 867952 (MANANTHAVADY)
Cons Prize-Rs :8000/- AN 867952AO 867952AP 867952 AR 867952AS 867952
AT 867952 AV 867952 AW 867952AX 867952AY 867952
AZ 867952
2nd Prize Rs :500000/- 1) AZ 499603 (ADOOR)
3rd Prize Rs :100000/- 1) AN 215264 (KOTTAYAM)
2) AO 852774 (PATTAMBI)
3) AP 953655 (KOTTAYAM)
4) AR 638904 (PAYYANUR)
5) AS 496774 (VAIKKOM)
6) AT 878990 (WAYANADU)
7) AU 703702 (PUNALUR)
8) AV 418446 (WAYANADU)
9) AW 994685 (KOZHIKKODE)
10) AX 317550 (PATTAMBI)
11) AY 854780 (CHITTUR)
12) AZ 899905 (KARUNAGAPALLY
...

instead i want to get:

 [
    {
        "1st Prize Rs :7000000",
        "tickets": [
            "AU 867952"
        ]
     },
    {
        "Cons Prize-Rs :8000",
        "tickets": [
            "AN 867952",
            "AO 867952",
            "AP 867952",
            "AR 867952",
            ...
        ]
     },
     ...
 ]

how can i achieve this ?


Solution

  • You could first get all the full parts from all the pages in capture groups.

    Then you can after process the 3rd capture group to get the separate "tickets" and in a loop create the wanted data structure.

    For the first separate groups, you can use a pattern that matches the start of every prize section, and captures all values until the next prize section.

    ^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)
    

    Regex demo

    For the after processing, you can use a pattern for the ticket formats, which matches either 2 uppercase chars, space and 6 digits, or 4 or more digits followed by a whitespace boundary.

    (?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))
    

    Regex demo

    Example code using the pdf file from the question:

    import re
    import pdfplumber
    import json
    
    pattern = r"^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)"
    
    with pdfplumber.open("./test11.pdf") as pdf:
        all_text = ""
    
        for page in pdf.pages:
            all_text += '\n' + page.extract_text()
    
        matches = re.finditer(pattern, all_text, re.MULTILINE)
    
        coll = []
        for matchNum, match in enumerate(matches):
            dct = {}
            dct[match.group(1)] = match.group(2)
            dct["tickets"] = re.findall(r"(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))", match.group(3))
            coll.append(dct)
    
        print(json.dumps(coll, indent=4))
    

    Output

    [
        {
            "1st Prize Rs ": "120000000",
            "tickets": [
                "XG 218582"
            ]
        },
        {
            "Cons Prize-Rs ": "500000",
            "tickets": [
                "XA 218582",
                "XB 218582",
                "XC 218582",
                "XD 218582",
                "XE 218582"
            ]
        },
        {
            "2nd Prize Rs ": "5000000",
            "tickets": [
                "XA 788417",
                "XB 161796",
                "XC 319503",
                "XD 713832",
                "XE 667708",
                "XG 137764"
            ]
        },
        ....