Search code examples
pythonmultithreadingweb-scrapingpython-requests-html

How to render asynchronous page with requests-html in a multithreaded environment?


In order to create a scraper for a page with dynamic loaded content, requests-html provides modules to get the rendered page after the JS execution. However, when trying to use the AsyncHTMLSession by calling the arender() method in a multithreaded implementation, the HTML generated doesn't change.

E.g. in the URL provided in the source code, the tables HTML values are empty by default and after the script execution, emulated by the arender() method it is expected to insert the values into the markup, though no visible changes are noticed in the source code.

from pprint import pprint

#from bs4 import BeautifulSoup
import asyncio
from timeit import default_timer
from concurrent.futures import ThreadPoolExecutor

from requests_html import AsyncHTMLSession, HTML

async def fetch(session, url):
    r = await session.get(url)
    await r.html.arender()
    return r.content

def parseWebpage(page):
    print(page)

async def get_data_asynchronous():  
    urls = [
        'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'
    ]  

    with ThreadPoolExecutor(max_workers=20) as executor:
        with AsyncHTMLSession() as session:
            # Set any session parameters here before calling `fetch` 

            # Initialize the event loop        
            loop = asyncio.get_event_loop()

            # Use list comprehension to create a list of
            # tasks to complete. The executor will run the `fetch`
            # function for each url in the urlslist
            tasks = [
                await loop.run_in_executor(
                    executor,
                    fetch,
                    *(session, url) # Allows us to pass in multiple arguments to `fetch`
                )
                for url in urls
            ]

            # Initializes the tasks to run and awaits their results
            for response in await asyncio.gather(*tasks):
                parseWebpage(response)

def main():
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(get_data_asynchronous())
    loop.run_until_complete(future)

main()

Solution

  • The source code representation post the execution of the rendering method is not under the content attribute of the session, but under raw_html in the HTML object. In this case, the value returned should be r.html.raw_html.