Search code examples
pythontabula

converting pdf file to excel with images in the table using python


The pdf files that I need to convert will have images in table. I want to convert the text as well as the images in tables of pdf to excel. Please suggest me suitable libraries for it.


Solution

  • You can use PikePDF to extract the images from the pdf:

    from pikepdf import Pdf, PdfImage
    
    filename = "sample.pdf"
    example = Pdf.open(filename)
    
    for i, page in enumerate(example.pages):
        for j, (name, raw_image) in enumerate(page.images.items()):
            image = PdfImage(raw_image)
            out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")
    

    After extracting the image you can then use OCR to convert the image to a table