I have the following code:
from requests_html import HTMLSession
ses = HTMLSession()
r = ses.get(MYURL) # start a headless chrome browser and load MYURL
r.render(keep_page=True) # This will now 'render' the html page
# which means like in a real browser trigger
# all requests for dependent links.
# like .css, .js, .jpg, .gif
Calling render()
triggers a load of requests for Javascript, bitmaps, etc.
Is there any way I can get a trace of status codes for each requests.
I'm mostly interested in 404
, but 403
and 5xx
errors might be interesting as well.
One use case would for example be:
• Go to a page or a sequence of pages
• Then report how many requests failed and which urls were accessed.
If this is not possible with requests-html but reasonably simple with selenium I can switch to selenium
Addendum: ugly work around 1:
I can setup logging to log into a file and set log level to debug.
Then I can try to parse logs of websockets.protocol: which contains strings like
{\\"url\\":\\"https://my.server/example.gif\\",\\"status\\":404,\\"statusText\\":\\"...
Issues:
Activating log level DEBUG into a file seems to activate something else, because suddenly loads of debug info is also logged into stdout
.
For example:
[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:45945/devtools/browser/bc5ce097-e67d-455e-8a59-9a4c213263c1
[D:pyppeteer.connection.Connection] SEND: {"id": 1, "method": "Target.setDiscoverTargets", "params": {"discover": true}}
Also it's not really fun to parse this in real time and to correlate it with the url I used in my code.
Addendum: ugly work around 2:
Even worse for correlating but nicer for parsing and just identifying 404
s and works just if in control of the http server.
Parsing the logs of the log http server with nginx I can even setup a custom logger in csv format with just the data I'm interested in.
Addendum: ugly work around 3:
Using python logging (a dedicated handler and filter for pyppeteer
) I can intercept a json string describing the responses from the pyppeteer.connection.CDPSession
logger without having stderr
become polluted.
The filter allows me to retrieve the data in real time.
This is still quite hackish. So looking for a better solution.
Give the following a try and see if it's what you're after. It's strictly a pyppeteer version (rather than requests_html) and relies on unexposed private variables so is fairly susceptible to breakage with version updates.
import asyncio
from pyppeteer import launch
from pyppeteer.network_manager import NetworkManager
def logit(event):
req = event._request
print("{0} - {1}".format(req.url, event._status))
async def main():
browser = await launch({"headless": False})
page = await browser.newPage()
page._networkManager.on(NetworkManager.Events.Response, logit)
await page.goto('https://www.google.com')
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Checking the source of requests_html the browser page object seems to be buried pretty deep-- so getting at the NetworkManager isn't exactly straightforward. If you really want it working from within requests_html it's probably easiest to monkeypatch. Here's an example:
import asyncio
from requests_html import HTMLSession, TimeoutError, HTML
from pyppeteer.network_manager import NetworkManager
from typing import Optional, Union
def logit(event):
req = event._request
print("{0} - {1}".format(req.url, event._status))
async def _async_render(self, *, url: str, script: str = None, scrolldown, sleep: int, wait: float, reload, content: Optional[str], timeout: Union[float, int], keep_page: bool, cookies: list = [{}]):
""" Handle page creation and js rendering. Internal use for render/arender methods. """
try:
page = await self.browser.newPage()
page._networkManager.on(NetworkManager.Events.Response, logit)
# Wait before rendering the page, to prevent timeouts.
await asyncio.sleep(wait)
if cookies:
for cookie in cookies:
if cookie:
await page.setCookie(cookie)
# Load the given page (GET request, obviously.)
if reload:
await page.goto(url, options={'timeout': int(timeout * 1000)})
else:
await page.goto(f'data:text/html,{self.html}', options={'timeout': int(timeout * 1000)})
result = None
if script:
result = await page.evaluate(script)
if scrolldown:
for _ in range(scrolldown):
await page._keyboard.down('PageDown')
await asyncio.sleep(sleep)
else:
await asyncio.sleep(sleep)
if scrolldown:
await page._keyboard.up('PageDown')
# Return the content of the page, JavaScript evaluated.
content = await page.content()
if not keep_page:
await page.close()
page = None
return content, result, page
except TimeoutError:
await page.close()
page = None
return None
ses = HTMLSession()
r = ses.get('https://www.google.com') # start a headless chrome browser and load MYURL
html = r.html
html._async_render = _async_render.__get__(html, HTML)
html.render()