Search code examples
pythonamazon-ec2web-scrapingnohupxvfb

Nohup run silenium webscraper in Ubuntu ec2


I have a web scraper that uses selenium which I want to run on my Ubuntu EC2 in the background event after I log out, so I am trying to use nohup. The current code I have is:

webscrape.py:

from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC 

def main():

    display = Display(visible=0, size=(800, 600))
    display.start() #starts vitual display

    driver = webdriver.Firefox()

    ...do the webscraping...

    driver.close()
    display.stop()

if __name__ == "__main__": main()

When I am logged in my EC2 and do python webscrape.py it runs normally. However, when I do nohup python webscrape.py and log out it stops. In the nohup.out log I get the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/Cruz/Scripts/WebScrape/google_brand_web_scraper.py", line 175, in <module>
    if __name__ == "__main__": main()
  File "/usr/local/lib/python2.7/dist-packages/Cruz/Scripts/WebScrape/google_brand_web_scraper.py", line 120, in main
    website = GoogleBrandWebsiteScraper().brand_url_pull_from_google(i,driver) # get website for a brand
  File "/usr/local/lib/python2.7/dist-packages/Cruz/Scripts/WebScrape/google_brand_web_scraper.py", line 34, in brand_url_pull_from_google
    s = BeautifulSoup(driver.page_source)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 436, in page_source
    return self.execute(Command.GET_PAGE_SOURCE)['value']
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 171, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
    return self._request(command_info[0], url, body=data)
  File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 379, in _request
    self._conn.request(method, parsed_url.path, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 973, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 829, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 791, in send
    self.connect()
  File "/usr/lib/python2.7/httplib.py", line 772, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 571, in create_connection
    raise err
socket.error: [Errno 111] Connection refused

So, apparently me logging out messes something up. Any hints appreciated.


Solution

  • You may want to try screen. I'm not familiar with nohup to be able to figure out the issue you're having there, but screen should work.

    1. You run screen to create a new terminal that you can do your work in.
    2. Run your code
    3. hit Ctrl + a followed by d to detach from that terminal (it will stay running in the background).
    4. Running screen -r will re-attach to that terminal.

    When you're "detached" from a terminal you can disconnect from a system and that detached terminal will continue running. So, between steps 3 and 4 you can disconnect.