Search code examples
pythonselenium-webdriverurllibpython-3.11bytestream

How can I get an file-like object from Selenium without download a file to a local path?


I'm working on a parser platform. I need to download files, save them directly to the FTP server. For this I have to get file-like object. I don't want to save junk temporary files.

I need to use selenium specifically

For example: I need to download this document, but for this I have to enter the data and accept the check.

This code passes notify and saves cookies

import os
import pickle
import time

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

def get_file(driver: webdriver.Chrome, url: str):
    driver.set_page_load_timeout(40)
    driver.get(url=url)
    time.sleep(2)

    # accept notify
    ccc_accept = driver.find_element(By.ID, 'ccc-notify-accept')
    if WebDriverWait(driver, 5).until(ec.element_to_be_clickable(ccc_accept)):
        ccc_accept.click()

    # Enter some data
    WebDriverWait(driver, 2).until(ec.presence_of_element_located((By.ID, 'agreement_form')))
    driver.find_element(By.ID, 'contact_name').send_keys('Company')
    driver.find_element(By.ID, 'contact_title').send_keys('People')
    driver.find_element(By.ID, 'company').send_keys('cb')
    driver.find_element(By.ID, 'country').send_keys('some')

    WebDriverWait(driver, 5).until(ec.presence_of_element_located(
        (By.XPATH, '//*[@id="doc_agreement"]/div[4]/input[1]')))

    # accept form
    if WebDriverWait(driver, 5).until(
            ec.element_to_be_clickable(driver.find_element(By.XPATH, '//*[@id="doc_agreement"]/div[4]/input[1]'))):
        driver.find_element(By.XPATH, '//*[@id="doc_agreement"]/div[4]/input[1]').click()

    time.sleep(2)

    # Save cookie
    pickle.dump(driver.get_cookies(), open('cookies.pkl', 'wb'))

    time.sleep(10)

On the web I only found a way to download a document via selenium to a local directory. This method can only install file to local_dir.


import os

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

def downloadDriver():
    options = webdriver.ChromeOptions()
    options.add_argument('window-size=1920x1080')
    options.add_argument("disable-gpu")

    path_loc = os.path.join(os.getcwd(), "temp")
    chrome_prefs = {
        "download.prompt_for_download": False,
        "plugins.always_open_pdf_externally": True,
        "download.open_pdf_in_system_reader": False,
        "profile.default_content_settings.popups": 0,
        "download.default_directory": path_loc,
    }
    options.add_experimental_option("prefs", chrome_prefs)
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    return driver

I tried to get the file object via urllib.request.urlopen(), but it throws a 403 error I also tried passing a cookie from Selenium to urllib, but this didn't solve the problem.

In what way can I get a stream or a file-like object or bytes, anything ?


Solution

    • first comment: Here is the direct link to your file https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Standard/PCI-DSS-v4_0.pdf. You don't need selenium to download it, you need a cookie otherwise you get a 403 Forbidden http response. Therefore you can use a module like requests and pass the cookie together with the url (see example). You can even use curl directly on the command line (see example). In case you want to generate the cookie without a browser and without any human intervention you can look for the URL to post the form (learn how to use network logging in your browser).

    • second comment: in case you really want to keep using selenium because you need to generate the cookie this way for some reason. You can use io standard library to create file-like object How to create in-memory file object

    • third comment: what's wrong with temporary files ? you can use temporary files that will get deleted once you close the file if you use the with statement (see tempfile.NamedTemporaryFile in python documentation). There is even a class to use in-memory file and disk when required. See tempfile.SpooledTemporaryFile