I have a web scraper that uses selenium which I want to run on my Ubuntu EC2 in the background event after I log out, so I am trying to use nohup
. The current code I have is:
webscrape.py:
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC
def main():
display = Display(visible=0, size=(800, 600))
display.start() #starts vitual display
driver = webdriver.Firefox()
...do the webscraping...
driver.close()
display.stop()
if __name__ == "__main__": main()
When I am logged in my EC2 and do python webscrape.py
it runs normally. However, when I do nohup python webscrape.py
and log out it stops. In the nohup.out
log I get the following error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/Cruz/Scripts/WebScrape/google_brand_web_scraper.py", line 175, in <module>
if __name__ == "__main__": main()
File "/usr/local/lib/python2.7/dist-packages/Cruz/Scripts/WebScrape/google_brand_web_scraper.py", line 120, in main
website = GoogleBrandWebsiteScraper().brand_url_pull_from_google(i,driver) # get website for a brand
File "/usr/local/lib/python2.7/dist-packages/Cruz/Scripts/WebScrape/google_brand_web_scraper.py", line 34, in brand_url_pull_from_google
s = BeautifulSoup(driver.page_source)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 436, in page_source
return self.execute(Command.GET_PAGE_SOURCE)['value']
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 171, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 379, in _request
self._conn.request(method, parsed_url.path, body, headers)
File "/usr/lib/python2.7/httplib.py", line 973, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 969, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 829, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 791, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 772, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 571, in create_connection
raise err
socket.error: [Errno 111] Connection refused
So, apparently me logging out messes something up. Any hints appreciated.
You may want to try screen
. I'm not familiar with nohup to be able to figure out the issue you're having there, but screen
should work.
screen
to create a new terminal that you can do your work in.screen -r
will re-attach to that terminal. When you're "detached" from a terminal you can disconnect from a system and that detached terminal will continue running. So, between steps 3 and 4 you can disconnect.