Search code examples
pythonpython-3.xpdftiff

Combine a bunch of PDFs converted from TIFF files as they're read in thru a loop


I've got a Python web scraper that crawls thru a bunch of TIFF pages online and converts each to PDF but I can't figure out how to combine all the converted PDFs into one and write it to my computer.

import img2pdf, requests
outPDF = []

for pgNum in range(1,20):
    tiff = requests.get("http://url-to-tiff-file.com/page="+str(pgNum)).content
    pdf = img2pdf.convert(tiff)
    outPDF.append(pdf)

with open("file","wb") as f:
    f.write(''.join(outPDF))

I get the following error when I run it:

f.write(''.join(outPDF))
TypeError: sequence item 0: expected str instance, bytes found

Update

If you go to http://oris.co.palm-beach.fl.us/or_web1/details_img.asp?doc_id=23543456&pg_num=1, then open up a web dev console in your browser, you can see a form tag with a bunch of ".tif" URLs in a bunch of hidden input tags.


Solution

  • img2pdf has some quirkiness when it comes to converting TIFF and PNG files. The code solves some of the potential issues within your code, because it uses Pillow to reformat the image files for processing with img2pdf

    import img2pdf
    from PIL import Image
    
    image_list = []
    test_images = ['image_01.tiff', 'image_02.tiff', 'image_03.tiff']
    for image in test_images:
       im = Image.open(f'{image}').convert('RGB')
       im.save(f'mod_{image}')
       image_list.append(f'mod_{image}')
    
    with open('test.pdf', 'wb') as f:
       letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
       layout = img2pdf.get_layout_fun(letter)
       f.write(img2pdf.convert(image_list, layout_fun=layout))
    

    I modified your code to use my code above, but I cannot test it, because I don't know what website that you're querying. So please let me know if something fails and I will troubleshoot it.

    import img2pdf
    import requests
    from PIL import Image
    from io import BytesIO
    
    outPDF = []
    
    for pgNum in range(1,20):
       tiff = requests.get("http://url-to-tiff-file.com/page="+str(pgNum)).content
       im = Image.open(BytesIO(tiff).convert('RGB')
       im.save(tiff)
       outPDF.append(tiff)
    
    with open("file.pdf","wb") as f:
       letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
       layout = img2pdf.get_layout_fun(letter)
       f.write(img2pdf.convert(outPDF, layout_fun=layout))
    

    UPDATED ANSWER

    After you provided the actual URL for the target website, I determined that the best way to accomplish your task was to go another route. Based on your use case you wanted the PDF file that was being produced from all the hidden TIFF files. The source website will generate the PDF without downloading all those TIFF files.

    Here is the code to get that generated PDF and download it to your system.

    import os
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    
    capabilities = DesiredCapabilities().CHROME
    
    chrome_options = Options()
    chrome_options.add_argument("--incognito")
    chrome_options.add_argument("--disable-infobars")
    chrome_options.add_argument("start-maximized")
    chrome_options.add_argument("--disable-extensions")
    chrome_options.add_argument("--disable-popup-blocking")
    
    download_directory = os.path.abspath('chrome_pdf_downloads')
    
    prefs = {"download.default_directory": download_directory,
         "download.prompt_for_download": False,
         "download.directory_upgrade": True,
         "plugins.always_open_pdf_externally": True}
    
    chrome_options.add_experimental_option('prefs', prefs)
    driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)
    
    url_main = 'http://oris.co.palm-beach.fl.us/or_web1/details_img.asp? doc_id=23543456&pg_num=1'
    
    driver.get(url_main)
    WebDriverWait(driver, 60)
    driver.find_element_by_xpath("//input[@name='button' and @onclick='javascript:ValidateAndSubmit(this.form)']").submit()
    

    If you still want to get the TIFF files, please let me know and I will look into downloading and processing them to produce the PDF file that the code above is obtaining.