Search code examples
pythondynamicweb-scrapingghost.py

Python: Using Ghost for dynamic webscraping


Trying to get the weather data from: http://metservice.com/maps-radar/local-observations/local-3-hourly-observations

Did find example here on how to use Ghost for web scraping dynamic content but I have not found out how to handle the result.

Since ghost seems to have issues when running in interactive shell I use

print(result)

to pipe output to file:

python getMetObservation.py > proper_result

This is my python code:

from ghost import Ghost
url = 'http://metservice.com/maps-radar/local-observations/local-3-hourly-observations'
gh = Ghost(wait_timeout=60)
page, resources = gh.open(url)
result, resources = gh.evaluate("document.getElementsByClassName('obs-content');")
print(result)

When examining the file it does contain what I am after but it also contains a huge amount of information I am not after. It is also not clear how to use the variable result that evaluate returns. Inspecting ghost.py it seems to be handled by

self.main_frame.evaluateJavaScript("%s" % script)

in:

def evaluate(self, script):
"""Evaluates script in page frame.

:param script: The script to evaluate.
"""
return (
self.main_frame.evaluateJavaScript("%s" % script),
self._release_last_resources(),
)

When I execute the command:

document.getElementsByClassName('obs-content');

in a Chromium console I get the correct response.

I am a beginner when it comes to python but willing to learn. Also note that I am running this in a python virtual environment under Ubuntu if it matters.


Solution

  • Note, I post this as answer since my current solution is to use the iMacros extension and save the webpage locally and then perform the scraping on the now static data using BeautifulSoup.

    The original question was on how to use Ghost to work on the dynamic page but since I did not get so far I found another solution which can be of use for others.

    The iMacro content(which I named GetWeather.iim):

    VERSION BUILD=8881205 RECORDER=FX
    TAB T=1
    URL GOTO=http://www.metservice.com/maps-radar/local-observations/local-3-hourly-observations
    WAIT SECONDS=5
    SAVEAS TYPE=CPL FOLDER=* FILE=+_{{!NOW:yyyymmdd_hhnnss}}

    shellscript called from crontab:

    #!/bin/bash
    export DISPLAY=:0.0
    /usr/bin/firefox &
    sleep 5 /usr/bin/firefox imacros://run/?m=GetWeather.iim
    sleep 10
    wmctrl -c "Mozilla Firefox"

    together with a python script doing the actual web scraping using BeautifulSoup.

    updated with proper way of stopping firefox without it to revert to safe mode as instructed in first answer to thread