In order to create a scraper for a page with dynamic loaded content, requests-html
provides modules to get the rendered page after the JS execution. However, when trying to use the AsyncHTMLSession
by calling the arender()
method in a multithreaded implementation, the HTML generated doesn't change.
E.g. in the URL provided in the source code, the tables HTML values are empty by default and after the script execution, emulated by the arender()
method it is expected to insert the values into the markup, though no visible changes are noticed in the source code.
from pprint import pprint
#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor
from requests_html import AsyncHTMLSession, HTML
async def fetch(session, url):
r = await session.get(url)
await r.html.arender()
return r.content
def parseWebpage(page):
print(page)
async def get_data_asynchronous():
urls = [
'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
]
with ThreadPoolExecutor(max_workers=20) as executor:
with AsyncHTMLSession() as session:
# Set any session parameters here before calling `fetch`
# Initialize the event loop
loop = asyncio.get_event_loop()
# Use list comprehension to create a list of
# tasks to complete. The executor will run the `fetch`
# function for each url in the urlslist
tasks = [
await loop.run_in_executor(
executor,
fetch,
*(session, url) # Allows us to pass in multiple arguments to `fetch`
)
for url in urls
]
# Initializes the tasks to run and awaits their results
for response in await asyncio.gather(*tasks):
parseWebpage(response)
def main():
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_data_asynchronous())
loop.run_until_complete(future)
main()
The source code representation post the execution of the rendering method is not under the content
attribute of the session, but under raw_html
in the HTML object. In this case, the value returned should be r.html.raw_html
.