Search code examples
pythonpython-requestsspyderdynamic-html

Python in Spyder with requests-html and aysnchronous 'render' is a nightmare to figure out


Starting point is Spyder IDE.

>Spyder IDE (5.1.0)
>
>The Scientific Python Development Environment | Spyder-IDE.org 
>
>Python 3.8.5 64-bit | Qt 5.12.9 | PyQt5 5.12.3 | Linux 5.4.0-81-generic

What do I want to do? Scrape a tricky blog, seems that blogspot is obfuscating a lot more, but within Spyder, I sometimes find that I cannot even scrape my own home page...

import asyncio
from requests_html import AsyncHTMLSession, HTML, HTMLSession
from bs4 import BeautifulSoup as bs
import re
import os, os.path
from pathlib2 import Path
from collections import OrderedDict as Odict
from datetime import datetime, date, timedelta
import pytz
import unicodedata
import sys

# asession = AsyncHTMLSession()
ass = AsyncHTMLSession()
sss = HTMLSession()

url='http://localhost/index.html'

def syncurl(session=None, url=None):
    r = session.get(url)
    return r

async def asyncurl(session=None, url=None):
    r = await session.get(url)
    #if r.status_code == 200:
        #await r.html.arender()
    return r
    
def gurl(ass, url):
    fiz = lambda : asyncurl(ass, url)
    foz = ass.run(fiz)
    return foz

So if I run this in Spyder then execute I get the expected 'loop already running' crap.

gurl(ass,url)
Traceback (most recent call last):

  File "<ipython-input-2-ebc91fe79d44>", line 1, in <module>
    gurl(ass,url)

  File "/home/user/PycharmProjects/blogscrape/BlogScraping/asynctest.py", line 38, in gurl
    foz = ass.run(fiz)

  File "/opt/anaconda3/lib/python3.8/site-packages/requests_html.py", line 774, in run
    done, _ = self.loop.run_until_complete(asyncio.wait(tasks))

  File "/opt/anaconda3/lib/python3.8/asyncio/base_events.py", line 592, in run_until_complete
    self._check_running()

  File "/opt/anaconda3/lib/python3.8/asyncio/base_events.py", line 552, in _check_running
    raise RuntimeError(f'This event loop is already running : {self._thread_id}')

RuntimeError: This event loop is already running : 139750638774080

I'm not trying to reinvent the wheel here, and I'm sure many others have this issue, but so far I've not seen a concise answer, (other than it's a Spyder bug etc). I just want it to work in Spyder, (principally, because I like to play around with pandas to look at the results). I suppose one way would be to run the thing as a stand alone script saving the results into a pickle, and THEN use spyder to reload the dataframe and use that. But, hey, why is that necessary?

The principal problem is the lack of clarity in requests-html. The error is very opaque to anyone who is simply trying to work around the original problem of ..

RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.

And yes, I have tried to Google this problem, but they always start talking 'asyncio' stuff. I'm reading the 'requests-html' help, anything beyond that is above my pay-grade (currently zero).

So any advice, please? (only simple stuff from asyncio that a simple IC designer could understand).


Solution

  • Thanks @Daniel, Yes, that does seem to work, to fix the issue shown above. It is not 100% perfect though, since some times I get a timeout error, that I'm not sure why, but I no longer get the timeout error.

    Just to put it all in one place.. After installing with,

    pip install nest_asyncio
    

    Just add the following to the python code.

    import nest_asyncio
    nest_asyncio.apply()
    

    This is enough to get the code running within Spyder, (as this was the original issue).

    Adding an extra sleep / timeout in the code for 'asyncurl' allows the script to run, albeit slowly, so don't try and run too many calls in the script. The above function is modified as follows.

    async def asyncurl(session=None, url=None):
        r = await session.get(url)
        await asyncio.sleep(5.0)
        # if r.status_code == 200:
        await r.html.arender(timeout=20000)
        return r