Search code examples
pythonpython-asyncioconcurrent.futurespython-requests-html

Concurrent.futures + requests_html's render() = "There is no current event loop in thread 'ThreadPoolExecutor-0_0'."


I am using Requests HTML to render javascript on a page. I am also using concurrent.futures to speed up the process. My code was working perfectly until I added the following line:

response.html.render(timeout=60, sleep=1, wait=3, retries=10)

upon which I got the error:

response.html.render(timeout=60, sleep=1, wait=3, retries=10)   

File "C:\Users\Ze\Anaconda3\lib\site-packages\requests_html.py", line 586, in render self.browser = self.session.browser # Automatically create a event loop and browser
File "C:\Users\Ze\Anaconda3\lib\site-packages\requests_html.py", line 727, in browser self.loop = asyncio.get_event_loop()
File "C:\Users\Ze\Anaconda3\lib\asyncio\events.py", line 639, in get_event_loop raise RuntimeError('There is no current event loop in thread %r.' RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-0_0'.

If I move the problematic line to within the below section it works again, but then the rendering is not occurring in parallel, right?

for result in concurrent.futures.as_completed(futures):
    result = result.result()

What is causing the problem? I've never used asyncio. Do I have to for this? Is it easy to implement?

Thank you very much!

CODE:

def load_page_and_extract_items(url):
    response = session.get(url, headers=get_headers())

    # render javascript
    response.html.render(timeout=60, wait=3)
    source = BeautifulSoup(response.html.raw_html, 'lxml')


def get_pages(remaining_urls):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # for each of 60 possible pages
        for current_page_number in range(60):
            futures = [executor.submit(load_page_and_extract_items, url) for url in remaining_urls]

            for result in concurrent.futures.as_completed(futures):
                result = result.result()

def main():
    get_pages(urls)

Solution

  • This doesn't directly answer the question but demonstrates a technique for multithreaded web-scraping which performs well in my tests. It's using the the URL stated in the original question and searches for certain tags that [may] contain HREFs and then processes those URLs. The general idea is that I create a pool of sessions and each thread gets a session object from the pool (a queue), uses it and then puts it back on the queue thus making it available for other threads.

    from requests_html import HTMLSession
    import concurrent.futures
    import queue
    
    QUEUE = queue.Queue()
    
    
    def makeSessions(n=4):
        for _ in range(n):
            QUEUE.put(HTMLSession())
    
    
    def cleanup():
        while True:
            try:
                getSession(False).close()
            except queue.Empty:
                break
    
    
    def getSession(block=True):
        return QUEUE.get(block=block)
    
    
    def freeSession(session):
        if isinstance(session, HTMLSession):
            QUEUE.put(session)
    
    
    def getURLs():
        urls = []
        session = getSession()
        try:
            response = session.get('https://www.aliexpress.com')
            response.raise_for_status()
            response.html.render()
            for a in response.html.xpath('//dt[@class="cate-name"]/span/a'):
                if 'href' in a.attrs:
                    urls.append(a.attrs['href'])
        finally:
            freeSession(session)
        return urls
    
    
    def processURL(url):
        print(url)
        session = getSession()
        try:
            response = session.get(url)
            response.raise_for_status()
            response.html.render()
        finally:
            freeSession(session)
    
    
    if __name__ == '__main__':
        try:
            makeSessions()
            with concurrent.futures.ThreadPoolExecutor() as executor:
                futures = [executor.submit(processURL, url) for url in getURLs()]
                for _ in concurrent.futures.as_completed(futures):
                    pass
        finally:
            cleanup()