I am trying to make a python code that automatically receives images from web pages.
The method is to get image response that can be obtained by accessing a specific web page using Selenium and copying the data of the image from the network of chrome devtool.
This is because specific sites are blocked by cloudflare, and if I use common methods such as requests or urllib.request, 403 errors occurred.
I can receive image data through 'Copy response' like a screenshot, but I want to get it using the chrome webdriver with python.
Copy response in Chrome devtools
from selenium import webdriver
option = webdriver.ChromeOptions()
option.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
option.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
browser = webdriver.Chrome(options=option)
browser.get(url)
time.sleep(5)
log_entries = browser.get_log("performance")
I got response header with above code but I want to get full response of images
To get responses, you should loop through logs and filter message
object by message that contain event Network.responseReceived
.
Then you get params
object and check if target_url_part is present in url
.
After getting it, you execute CDP command Network.getResponseBody
with requestId
from params.
Depends on response body, you can perform further actions, like getting it's json field / convert it into image, etc.
Similar question answer reference
from selenium import webdriver
import json
import time
option = webdriver.ChromeOptions()
option.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
option.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
browser = webdriver.Chrome(options=option)
log_entries = browser.get_log("performance")
url = 'site_url'
browser.get(url)
time.sleep(5)
target_url = "your_request_url_part"
for log in log_entries:
message = log["message"]
if "Network.responseReceived" not in message:
continue
params = json.loads(message)["message"].get("params")
if params is None:
continue
response = params.get("response")
if response is None or target_url not in response["url"]:
continue
body = browser.execute_cdp_cmd('Network.getResponseBody', {'requestId': params["requestId"]})