I've got a Python web scraper that crawls thru a bunch of TIFF pages online and converts each to PDF but I can't figure out how to combine all the converted PDFs into one and write it to my computer.
import img2pdf, requests
outPDF = []
for pgNum in range(1,20):
tiff = requests.get("http://url-to-tiff-file.com/page="+str(pgNum)).content
pdf = img2pdf.convert(tiff)
outPDF.append(pdf)
with open("file","wb") as f:
f.write(''.join(outPDF))
I get the following error when I run it:
f.write(''.join(outPDF))
TypeError: sequence item 0: expected str instance, bytes found
Update
If you go to http://oris.co.palm-beach.fl.us/or_web1/details_img.asp?doc_id=23543456&pg_num=1
, then open up a web dev console in your browser, you can see a form
tag with a bunch of ".tif" URLs in a bunch of hidden input
tags.
img2pdf has some quirkiness when it comes to converting TIFF and PNG files. The code solves some of the potential issues within your code, because it uses Pillow to reformat the image files for processing with img2pdf
import img2pdf
from PIL import Image
image_list = []
test_images = ['image_01.tiff', 'image_02.tiff', 'image_03.tiff']
for image in test_images:
im = Image.open(f'{image}').convert('RGB')
im.save(f'mod_{image}')
image_list.append(f'mod_{image}')
with open('test.pdf', 'wb') as f:
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
f.write(img2pdf.convert(image_list, layout_fun=layout))
I modified your code to use my code above, but I cannot test it, because I don't know what website that you're querying. So please let me know if something fails and I will troubleshoot it.
import img2pdf
import requests
from PIL import Image
from io import BytesIO
outPDF = []
for pgNum in range(1,20):
tiff = requests.get("http://url-to-tiff-file.com/page="+str(pgNum)).content
im = Image.open(BytesIO(tiff).convert('RGB')
im.save(tiff)
outPDF.append(tiff)
with open("file.pdf","wb") as f:
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
f.write(img2pdf.convert(outPDF, layout_fun=layout))
UPDATED ANSWER
After you provided the actual URL for the target website, I determined that the best way to accomplish your task was to go another route. Based on your use case you wanted the PDF file that was being produced from all the hidden TIFF files. The source website will generate the PDF without downloading all those TIFF files.
Here is the code to get that generated PDF and download it to your system.
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
capabilities = DesiredCapabilities().CHROME
chrome_options = Options()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
download_directory = os.path.abspath('chrome_pdf_downloads')
prefs = {"download.default_directory": download_directory,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)
url_main = 'http://oris.co.palm-beach.fl.us/or_web1/details_img.asp? doc_id=23543456&pg_num=1'
driver.get(url_main)
WebDriverWait(driver, 60)
driver.find_element_by_xpath("//input[@name='button' and @onclick='javascript:ValidateAndSubmit(this.form)']").submit()
If you still want to get the TIFF files, please let me know and I will look into downloading and processing them to produce the PDF file that the code above is obtaining.