Search code examples
python-3.xregexpython-requestsocrregex-group

find a text with pattern from google lens response using regex


i am trying to get learner license number from google lens by uploading image but my regex is not working as license number are appearing in following patterns

KL 14 /0000007/2023

KL14 /0000007/2023

KL 14/0000007/2023

KL 14 /0000007/ 2023

etc which means there may be space between or may not

my regex is KL [0-9]{1}./.[0-9]{1}./.[0-9]{1}. but it is not working

my code `from lxml.html import soupparser import re import os import requests folder_dir = os.getcwd() for images in os.listdir(folder_dir): try:

    # check if the image end swith png or jpg or jpeg
    if (images.endswith(".png") or images.endswith(".jpg") \
            or images.endswith(".jpeg")):


        proxy = '127.0.0.1:8080'
        os.environ['http_proxy'] = proxy
        os.environ['HTTP_PROXY'] = proxy
        os.environ['https_proxy'] = proxy
        os.environ['HTTPS_PROXY']= proxy
        os.environ['REQUESTS_CA_BUNDLE'] = "C:\\Users\\User\\Desktop\\cacert.pem"


        print("-------------------------------------------------------------------------------------")
        print(images)
        print("\n")
        captchaurl = 'https://lens.google.com/upload?ep=ccm&s=csp&st=1653142987619'
        encoded_image = {'encoded_image': open(images, 'rb')}
        burp0cap_headers = {"Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1",
                            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36",
                            "Origin": "null",
                            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                            "Sec-Gpc": "1", "Sec-Fetch-Site": "none",
                            "Sec-Fetch-Mode": "navigate", "Sec-Fetch-User": "?1",
                            "Sec-Fetch-Dest": "document", "Accept-Encoding": "gzip, deflate",
                            "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"}
        rlens = requests.post(captchaurl, files=encoded_image, headers=burp0cap_headers,
                              allow_redirects=True)
        DATA000 = str(rlens.content)
        # print(DATA000)
        root = soupparser.fromstring(DATA000)
        result_url = root.xpath('//meta[@http-equiv="refresh"]/@content')
        result_url = str(result_url[0])
        url2 = result_url.split('URL=')
        finalurl = str(url2[1])
        # print(finalurl)
        burp1cap_headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36",
            "Accept-Encoding": "gzip, deflate",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
            "Cache-Control": "max-age=0", "Upgrade-Insecure-Requests": "1", "Origin": "null",
            "Sec-Gpc": "1", "Sec-Fetch-Site": "none", "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-User": "?1", "Sec-Fetch-Dest": "document",
            "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"}
        r2 = requests.get(finalurl, headers=burp1cap_headers)
        r3 = str(r2.text)

        r4 = r3.replace('"', '')
        #print(r4)

        phoneNumRegex2 = re.compile(r'KL *[0-9]{1}.*\/.*[0-9]{1}.*\/.*[0-9]+')
       
        mo = phoneNumRegex2.search(str(r4))
        print(mo.group())
        
except Exception as e:
    print(e)`

response of google lens is

"text:0:e90nKYDCi5I\u003d"],6,[]]],[[],3]]]],[]],[[],null,null,"en",[[["FORM 3 [See Rule 3(a) a","LEARNER'S L","Application No... 394442223","Learner's Licence","KL 14 /0002707/2023","Issue Date.....","1. Name","SATHEESAN U","2. Father's Name","CHOUKAR K","Date of Birth","07-03-1984"]],"Ad7f3FjZKr2A8ovUoig+fwJqhVKxG6sbvcjciTQV+KzOBTZf2VGydPYtpIkEMPU6sQyWL+Ad8/Vjl0/OV0izP/oXCluFA2xNbzAktl3KxaOVnfyvyS3kTwHv",[1678139279,21105500

something including above i need to get learner licnese from above response

output gives none vaule

i will provide sample images as attacthedenter image description here


Solution

  • This regex considers whitespace between any of the elements:

    KL\s*\d+\s*/\s*\d+\s*/\s*\d+
    

    \s* means zero or more whitespace characters. Then you match all the digits with \d+, which means one or more digit - you matched only 1 digit incorrectly with your regex.

    Regex101 playground/explantation