Search code examples
pythonopencvqr-code

QR code detecting in python with OpenCV raises UnicodeDecodeError: 'utf-8' codec can't decode byte


I have written a class to retrieve creditor data like the Iban from an invoice pdf with an qr code on it. It worked fine, until we've gotten an pdf that throws this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 157: invalid start byte

If I try to process the pdf.

That is how I've done it:

doc = fitz.open(self.image_path)  # open document
i = 0

if not os.path.exists(f"./qr_codes/"):
    os.makedirs(f"./qr_codes/")

for page in doc:
    pix = page.get_pixmap(matrix=self.mat)  # render page to an image
    pix.save(f"./qr_codes/page_{i}.png")

    img = cv2.imread(f"./qr_codes/page_{i}.png")
    detect = cv2.QRCodeDetector()
    text, points, straight_qrcode = detect.detectAndDecode(img)

    if text:
        # try to find a IBAN in one of the lines
        self.iban = "\r\n".join([line for line in text.splitlines() if re.findall(r"CH\d{19}", line.strip())])
        # try to find the reference number by joining all lines and searching for CH QRR \d+
        # Also replace the CH QRR Stuff, because only the number is needed for SAP
        ref_number = re.findall(r'CH\s*QRR\s*\d+|$', " ".join(text.splitlines()))
        self.ref_number = int(re.sub(r"\D","", ref_number[0])) if ref_number else None
        self.__save_values()
        return True
    i += 1
return False

Is there a way to strip the bytes somehow?

I've tried it via numpy array also:

    stream = open(f'./qr_codes/page_{i}.png', encoding="utf-8", errors="ignore")
    stream = bytearray(stream.read(), encoding="utf-8")
    detect = cv2.QRCodeDetector()
    text, points, straight_qrcode = detect.detectAndDecode(numpy.asarray(stream, dtype=numpy.uint8))
    # print(text)

But this way I only retrieve an empty text instead, so I'm doing something wrong this way I guess. Could someone provide some ideas on how to solve the byte issue?

Edit: As asked, the full Traceback

Traceback (most recent call last):
  File "C:\Users\m7073\Repos\Chronos_New\invoice_extraction\qr_code_scan.py", line 128, in <module>
    qrcode.set_qr_values()
  File "C:\Users\m7073\Repos\Chronos_New\invoice_extraction\qr_code_scan.py", line 73, in set_qr_values
    text, points, straight_qrcode = detect.detectAndDecode(img)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 157: invalid start byte

Edit 2(minimal reproducable example):

import cv2
img = cv2.imread(f"page_1.png")
detect = cv2.QRCodeDetector()
text, points, straight_qrcode = detect.detectAndDecode(img)

enter image description here


Solution

  • Late reply because I was ill for 3 weeks, but I've switched to zxingcpp and adjusted the code from:

    for page in doc:
        pix = page.get_pixmap(matrix=self.mat)  # render page to an image
        pix.save(f"./qr_codes/page_{i}.png")
    
        img = cv2.imread(f"./qr_codes/page_{i}.png")
        detect = cv2.QRCodeDetector()
        text, points, straight_qrcode = detect.detectAndDecode(img)
    
        if text:
            # try to find a IBAN in one of the lines
            self.iban = "\r\n".join([line for line in text.splitlines() if re.findall(r"CH\d{19}", line.strip())])
            # try to find the reference number by joining all lines and searching for CH QRR \d+
            # Also replace the CH QRR Stuff, because only the number is needed for SAP
            ref_number = re.findall(r'CH\s*QRR\s*\d+|$', " ".join(text.splitlines()))
            self.ref_number = int(re.sub(r"\D","", ref_number[0])) if ref_number else None
            self.__save_values()
            return True
        i += 1
    

    to:

    for page in pdf:
        pil_image = page.render(scale=3).to_pil()
        pil_image.save(f"./qr_codes/page_{i}.png")
        img = cv2.imread(f"./qr_codes/page_{i}.png")
    
        for result in zxingcpp.read_barcodes(img):
            # try to find a IBAN in one of the lines
            self.iban = "\r\n".join([line for line in result.text.splitlines() if re.findall(r"CH\d{19}", line.strip())])
            # try to find the reference number by joining all lines and searching for CH QRR \d+
            # Also replace the CH QRR Stuff, because only the number is needed for SAP
            ref_number = re.findall(r'CH\s*QRR\s*\d+|$', " ".join(result.text.splitlines()))
            self.ref_number = int(re.sub(r"\D", "", ref_number[0])) if ref_number else None
            self.__save_values()
            return True
        i += 1
    

    And this way it works.

    Simple example to get it going:

    import cv2
    img = cv2.imread(f"./qr_codes/page_1.png")
    detect = cv2.QRCodeDetector(
    # iterate over QR Codes
    for result in zxingcpp.read_barcodes(img):
        print(result.text)
    

    Edit: I had troubles getting the zxingcpp module to work on the amazone linux server we're using. I solved it this way:

     yum groupinstall 'Development Tools'yum groupinstall 'Development Tools'
     pip install --upgrade setuptools wheel
     yum install python3-devel.x86_64
     pip install zxing-cpp