Search code examples
pythonseleniumselenium-webdriverweb-scrapingghost.py

Scrape data with Ghost


I want to scrap some data on the following link:

http://www.six-structured-products.com/en/search-find/new-search#search_type=profi&class_category=svsp

My target is simply to retrieve the table of all instruments (displayed in "search results" on page 1,2,3, etc) in a data.frame. I can't simply use urllib and urllib2 to retrieve static data since I need to mimic a human by cliking on buttons: Ghost or Selenium are the way to go.

However, I really do not get how to translate into code "click on page 2", "click on page 3" ... as well as getting the total number of pages.

My code:

from ghost import Ghost

url = "http://www.six-structured-products.com/en/search-find/new-search#search_type=profi&class_category=svsp"

gh = Ghost()
page, resources = gh.open(url)

I am stuck there and do not know which identifier to put instead of XXX:

page, resources = ghost.evaluate(
"document.getElementById(XXX).click();", expect_loading=True)

(I would also accept a solution using Selenium)


Solution

  • You can also use the next button this way:

    import logging
    import sys
    
    from ghost import Ghost, TimeoutError
    
    
    logging.basicConfig(level=logging.INFO)
    
    url = "http://www.six-structured-products.com/en/search-find/new-search#search_type=profi&class_category=svsp"
    
    ghost = Ghost(wait_timeout=20, log_level=logging.CRITICAL)
    data = dict()
    
    
    def extract_value(line, ntd):
        return line.findFirst('td.DataItem:nth-child(%d)' % ntd).toPlainText()
    
    
    def extract(ghost):
        lines = ghost.main_frame.findAllElements(
            '.derivativeSearchResult > tbody:nth-child(2) tr'
        )
    
        for line in lines:
            symbol = extract_value(line, 2)
            name = extract_value(line, 5)
            logging.info("Found %s: %s" % (symbol, name))
            # Persist data here
    
        ghost.sleep(1)
    
        try:
            ghost.click('.pagination_next a', expect_loading=True)
        except TimeoutError:
            sys.exit(0)
    
        extract(ghost)
    
    
    ghost.open(url)
    extract(ghost)