I'm having trouble implementing dryscrape on and ubuntu 16.04 server (clean install on digital ocean) - with the objective of scraping JS populated websites.
I'm following dryscrape install instructions from here:
apt-get update
apt-get install qt5-default libqt5webkit5-dev build-essential \
python-lxml python-pip xvfb
pip install dryscrape
and then running the following python script which I found here as well as the test html page at the same link. (It returns html or JS)
Python
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
my_url = 'http://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
HTML - scrape.php
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
When I do I can't seem to get the expected return data, instead it's just errors.
I'm wondering if there is anything obvious that I'm missing ?
Note: I've trawled numerous install guides/threads and can't seem to get it working. I've also attempted to use selenium but can't seem to get anywhere with it either. Many thanks.
Output
Traceback (most recent call last):
File "js.py", line 3, in <module>
session = dryscrape.Session()
File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 22, in __init__
self.driver = driver or DefaultDriver()
File "/usr/local/lib/python2.7/dist-packages/dryscrape/driver/webkit.py", line 30, in __init__
super(Driver, self).__init__(**kw)
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 230, in __init__
self.conn = connection or ServerConnection()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 507, in __init__
self._sock = (server or get_default_server()).connect()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 450, in get_default_server
_default_server = Server()
File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 424, in __init__
raise NoX11Error("Could not connect to X server. "
webkit_server.NoX11Error: Could not connect to X server. Try calling dryscrape.start_xvfb() before creating a session.
Working Script
import dryscrape
from bs4 import BeautifulSoup
dryscrape.start_xvfb()
session = dryscrape.Session()
my_url = 'https://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response, "html.parser")
print soup.find(id="intro-text").text
You have no X server running. The clue is
Try calling dryscrape.start_xvfb() before creating a session
See http://dryscrape.readthedocs.io/en/latest/usage.html
if 'linux' in sys.platform:
# start xvfb in case no X is running. Make sure xvfb
# is installed, otherwise this won't work!
dryscrape.start_xvfb()
http://dryscrape.readthedocs.io/en/latest/installation.html
xvfb_ (necessary only if no other X server is available)
So you can just add:
dryscrape.start_xvfb()
before:
session = dryscrape.Session()