Search code examples
javascriptpythonubuntuweb-scrapingdryscrape

drysrape install Ubuntu server 16.04


I'm having trouble implementing dryscrape on and ubuntu 16.04 server (clean install on digital ocean) - with the objective of scraping JS populated websites.

I'm following dryscrape install instructions from here:

apt-get update
apt-get install qt5-default libqt5webkit5-dev build-essential \
                  python-lxml python-pip xvfb

pip install dryscrape

and then running the following python script which I found here as well as the test html page at the same link. (It returns html or JS)

Python

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
my_url = 'http://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")

HTML - scrape.php

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

When I do I can't seem to get the expected return data, instead it's just errors.

I'm wondering if there is anything obvious that I'm missing ?

Note: I've trawled numerous install guides/threads and can't seem to get it working. I've also attempted to use selenium but can't seem to get anywhere with it either. Many thanks.

Output

Traceback (most recent call last):
  File "js.py", line 3, in <module>
    session = dryscrape.Session()
  File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 22, in __init__
    self.driver = driver or DefaultDriver()
  File "/usr/local/lib/python2.7/dist-packages/dryscrape/driver/webkit.py", line 30, in __init__
    super(Driver, self).__init__(**kw)
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 230, in __init__
    self.conn = connection or ServerConnection()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 507, in __init__
    self._sock = (server or get_default_server()).connect()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 450, in get_default_server
    _default_server = Server()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 424, in __init__
    raise NoX11Error("Could not connect to X server. "
webkit_server.NoX11Error: Could not connect to X server. Try calling dryscrape.start_xvfb() before creating a session.

Working Script

import dryscrape
from bs4 import BeautifulSoup

dryscrape.start_xvfb()
session = dryscrape.Session()
my_url = 'https://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response, "html.parser")
print soup.find(id="intro-text").text

Solution

  • You have no X server running. The clue is

    Try calling dryscrape.start_xvfb() before creating a session

    See http://dryscrape.readthedocs.io/en/latest/usage.html

    if 'linux' in sys.platform:
        # start xvfb in case no X is running. Make sure xvfb 
        # is installed, otherwise this won't work!
        dryscrape.start_xvfb()
    

    http://dryscrape.readthedocs.io/en/latest/installation.html

    xvfb_ (necessary only if no other X server is available)

    So you can just add:

    dryscrape.start_xvfb()
    

    before:

    session = dryscrape.Session()