I need to convert lots of jpg/png files to docx files & then to pdf. My sole concern is to write the data in an image to a pdf file & if I need to edit any text manually, I can do that in word & save it in the corresponding pdf file.
I've tried using API but failed as the text is not correctly matching.
My image files contain only texts & not anything else.
I already have docx to pdf conversion code in Python.
from docx2pdf import convert
input = 'INPUT_FILE_NAME.docx'
output = 'OUTPUT_FILE_NAME.pdf'
convert(input)
convert(input, output)
convert("Output")
Kindly suggest me how to convert a png/jpg file to docx. Thanks.
EDIT --------------
I've successfully made this code run. I've uploaded in my github repo.
from PIL import Image
from pytesseract import pytesseract
#Define path to tessaract.exe
path_to_tesseract = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
#Define path to image
path_to_image = 'texttoimage.png'
#Point tessaract_cmd to tessaract.exe
pytesseract.tesseract_cmd = path_to_tesseract
#Open image with PIL
img = Image.open(path_to_image)
#Extract text from image
text = pytesseract.image_to_string(img)
print(text)