I have written a class to retrieve creditor data like the Iban from an invoice pdf with an qr code on it. It worked fine, until we've gotten an pdf that throws this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 157: invalid start byte
If I try to process the pdf.
That is how I've done it:
doc = fitz.open(self.image_path) # open document
i = 0
if not os.path.exists(f"./qr_codes/"):
os.makedirs(f"./qr_codes/")
for page in doc:
pix = page.get_pixmap(matrix=self.mat) # render page to an image
pix.save(f"./qr_codes/page_{i}.png")
img = cv2.imread(f"./qr_codes/page_{i}.png")
detect = cv2.QRCodeDetector()
text, points, straight_qrcode = detect.detectAndDecode(img)
if text:
# try to find a IBAN in one of the lines
self.iban = "\r\n".join([line for line in text.splitlines() if re.findall(r"CH\d{19}", line.strip())])
# try to find the reference number by joining all lines and searching for CH QRR \d+
# Also replace the CH QRR Stuff, because only the number is needed for SAP
ref_number = re.findall(r'CH\s*QRR\s*\d+|$', " ".join(text.splitlines()))
self.ref_number = int(re.sub(r"\D","", ref_number[0])) if ref_number else None
self.__save_values()
return True
i += 1
return False
Is there a way to strip the bytes somehow?
I've tried it via numpy array also:
stream = open(f'./qr_codes/page_{i}.png', encoding="utf-8", errors="ignore")
stream = bytearray(stream.read(), encoding="utf-8")
detect = cv2.QRCodeDetector()
text, points, straight_qrcode = detect.detectAndDecode(numpy.asarray(stream, dtype=numpy.uint8))
# print(text)
But this way I only retrieve an empty text instead, so I'm doing something wrong this way I guess. Could someone provide some ideas on how to solve the byte issue?
Edit: As asked, the full Traceback
Traceback (most recent call last):
File "C:\Users\m7073\Repos\Chronos_New\invoice_extraction\qr_code_scan.py", line 128, in <module>
qrcode.set_qr_values()
File "C:\Users\m7073\Repos\Chronos_New\invoice_extraction\qr_code_scan.py", line 73, in set_qr_values
text, points, straight_qrcode = detect.detectAndDecode(img)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 157: invalid start byte
Edit 2(minimal reproducable example):
import cv2
img = cv2.imread(f"page_1.png")
detect = cv2.QRCodeDetector()
text, points, straight_qrcode = detect.detectAndDecode(img)
Late reply because I was ill for 3 weeks, but I've switched to zxingcpp and adjusted the code from:
for page in doc:
pix = page.get_pixmap(matrix=self.mat) # render page to an image
pix.save(f"./qr_codes/page_{i}.png")
img = cv2.imread(f"./qr_codes/page_{i}.png")
detect = cv2.QRCodeDetector()
text, points, straight_qrcode = detect.detectAndDecode(img)
if text:
# try to find a IBAN in one of the lines
self.iban = "\r\n".join([line for line in text.splitlines() if re.findall(r"CH\d{19}", line.strip())])
# try to find the reference number by joining all lines and searching for CH QRR \d+
# Also replace the CH QRR Stuff, because only the number is needed for SAP
ref_number = re.findall(r'CH\s*QRR\s*\d+|$', " ".join(text.splitlines()))
self.ref_number = int(re.sub(r"\D","", ref_number[0])) if ref_number else None
self.__save_values()
return True
i += 1
to:
for page in pdf:
pil_image = page.render(scale=3).to_pil()
pil_image.save(f"./qr_codes/page_{i}.png")
img = cv2.imread(f"./qr_codes/page_{i}.png")
for result in zxingcpp.read_barcodes(img):
# try to find a IBAN in one of the lines
self.iban = "\r\n".join([line for line in result.text.splitlines() if re.findall(r"CH\d{19}", line.strip())])
# try to find the reference number by joining all lines and searching for CH QRR \d+
# Also replace the CH QRR Stuff, because only the number is needed for SAP
ref_number = re.findall(r'CH\s*QRR\s*\d+|$', " ".join(result.text.splitlines()))
self.ref_number = int(re.sub(r"\D", "", ref_number[0])) if ref_number else None
self.__save_values()
return True
i += 1
And this way it works.
Simple example to get it going:
import cv2
img = cv2.imread(f"./qr_codes/page_1.png")
detect = cv2.QRCodeDetector(
# iterate over QR Codes
for result in zxingcpp.read_barcodes(img):
print(result.text)
Edit: I had troubles getting the zxingcpp module to work on the amazone linux server we're using. I solved it this way:
yum groupinstall 'Development Tools'yum groupinstall 'Development Tools'
pip install --upgrade setuptools wheel
yum install python3-devel.x86_64
pip install zxing-cpp