Search code examples
pythonproxytorsockspython-requests-html

Python requests_html: Socks5h proxy does not work when calling "render()"


I'm using "python requests_html" because I want to get the rendered html source code. In addition, I want to do that via socks5h(Tor) proxy.

So, I tried to write the following code. However, once render() function was called, raw ip address is displayed. This seems that render() function doesn't use proxy settings.

Actually, I tried to connect to tor bbc news (onion domain) using the following code, it failed, because that's not tor network.

Is there any good idea to render using socks5h proxy?

from requests_html import HTMLSession

url = "http://ifconfig.me/ip"
# url = "https://www.bbcnewsv2vjtpsuy.onion/" # bbc news
session = HTMLSession()

proxies = {"http": "socks5h://localhost:9150","https": "socks5h://localhost:9150"}
r = session.get(url, proxies=proxies)
content = r.html
print(content.text) # Tor’s IP will be displayed 

content.render()    # rendering for javascript, etc..
print(content.text) # Raw IP will be displayed

Error message when trying to access tor bbcnews site:

Traceback (most recent call last): File "requests_html_01.py", line 12, in content.render() # rendering for javascript File "/home/testuser/.local/lib/python3.6/site-packages/requests_html.py", line 598, in render content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page)) File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete return future.result() File "/home/testuser/.local/lib/python3.6/site-packages/requests_html.py", line 512, in _async_render await page.goto(url, options={'timeout': int(timeout * 1000)}) File "/home/testuser/.local/lib/python3.6/site-packages/pyppeteer/page.py", line 879, in goto raise PageError(result) pyppeteer.errors.PageError: net::ERR_INTERNET_DISCONNECTED at https://www.bbcnewsv2vjtpsuy.onion/


Solution

  • Sorry for the self answer. requests_html uses pyppeteer internally, and this proxy issue depends on pyppeteer. Current requests_html seems that it doesn't pass proxy information, so pyppeteer doesn't use proxy. According to the following github pages, it seems that this issue would be solved in the future.