Search code examples
pythonpython-3.xweb-scrapingpython-requestspython-requests-html

Trouble getting the trade-price using "Requests-HTML" library


I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator like selenium or something because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully. When I run the script, I get the following error. Any help on this will be highly appreciated.

Site address : webpage_link

The script I've tried with:

import requests_html

with requests_html.HTMLSession() as session:
    r = session.get('https://www.gdax.com/trade/LTC-EUR')
    js = r.html.render()
    item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
    print(item)

This is the complete traceback:

Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
    self._callback(*self._args)
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
    self._timeout)
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
    raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\experiment.py", line 6, in <module>
    item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
AttributeError: 'NoneType' object has no attribute 'find'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\shutil.py", line 387, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\ar\\.pyppeteer\\.dev_profile\\tmp1gng46sw\\CrashpadMetrics-active.pma'

The price I'm after is available on the top of the page which can be visible like this 177.59 EUR Last trade price. I wish to get 177.59 or whatever the current price is.


Solution

  • You have several errors. The first is a 'navigation' timeout, showing that the page didn’t complete rendering:

    Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
    handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
    Traceback (most recent call last):
      File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
        self._callback(*self._args)
      File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
        self._timeout)
      File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
        raise error
    concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
    

    This traceback is not raised in the main thread, your code was not aborted because of this. Your page may or may not be complete; you may want to set a longer timeout or introduce a sleep cycle for the browser to have time to process AJAX responses.

    Next, the response.html.render() element returns None. It loads the HTML into a headless Chromium browser, leaves JavaScript rendering to that browser, then copies back the page HTML into the response.html datasctructure in place, and nothing needs to be returned. So js is set to None, not a new HTML instance, causing your next traceback.

    Use the existing response.html object to search, after rendering:

    r.html.render()
    item = r.html.find('.MarketInfo_market-num_1lAXs', first=True)
    

    There is most likely no such CSS class, because the last 5 characters are generated on each page render, after JSON data is loaded over AJAX. This makes it hard to use CSS to find the element in question.

    Moreover, I found that without a sleep cycle, the browser has no time to fetch AJAX resources and render the information you wanted to load. Give it, say, 10 seconds of sleep to do some work before copying back the HTML. Set a longer timeout (the default is 8 seconds) if you see network timeouts:

    r.html.render(timeout=10, sleep=10)
    

    You could set the timeout to 0 too, to remove the timeout and just wait indefinitely until the page has loaded.

    Hopefully a future API update also provides features to wait for network activity to cease.

    You can use the included parse library to find the matching CSS classes:

    # search for CSS suffixes
    suffixes = [r[0] for r in r.html.search_all('MarketInfo_market-num_{:w}')]
    for suffix in suffixes:
        # for each suffix, find all matching elements with that class
        items = r.html.find('.MarketInfo_market-num_{}'.format(suffix))
        for item in items:
            print(item.text)
    

    Now we get output produced:

    169.81 EUR
    +
    1.01 %
    18,420 LTC
    169.81 EUR
    +
    1.01 %
    18,420 LTC
    169.81 EUR
    +
    1.01 %
    18,420 LTC
    169.81 EUR
    +
    1.01 %
    18,420 LTC
    

    Your last traceback shows that the Chromium user data path could not be cleaned up. The underlying Pyppeteer library configures the headless Chromium browser with a temporary user data path, and in your case the directory contains some still-locked resource. You can ignore the error, although you may want to try and remove any remaining files in the .pyppeteer folder at a later time.