Search code examples
javascriptpythonweb-scrapingbeautifulsouppython-object

Parsing html from a javascript rendered url with python object


I would like to extract the market information from the following url and all of its subsequent pages:

https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1

I have successfully parsed the data that I want from the first page using some code from the following url:

https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages

I have also been able to parse out the url for the next page to feed into a loop in order to grab data from the next page. The problem is it crashes before the next page loads for a reason I don't fully understand.

I have a hunch that the class that I have borrowed from 'impythonist' may be causing the problem. I don't know enough object orientated programming to work out the problem. Here is my code, much of which is borrowed from the the url above:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html
import re
from bs4 import BeautifulSoup

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  



base_url='https://uk.reuters.com'
complete_next_page='https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1'

#LOOP TO RENDER PAGES AND GRAB DATA
while complete_next_page != '':
    print ('NEXT PAGE: ',complete_next_page, '\n')
    r = Render(complete_next_page)  # USE THE CLASS TO RENDER JAVASCRIPT FROM PAGE
    result = r.frame.toHtml()     # ERROR IS THROWN HERE ON 2nd PAGE

# PARSE THE HTML
soup = BeautifulSoup(result, 'lxml')
row_data=soup.find('div', attrs={'class':'column1 gridPanel grid8'})
print (len(row_data))

# PARSE ALL ROW DATA
stripe_rows=row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows=row_data.findAll('tr', attrs={'class':''})
print (len(stripe_rows))
print (len(non_stripe_rows))

# PARSE SPECIFIC ROW DATA FROM INDEX COMPONENTS
#non_stripe_rows: from 4 to 18 (inclusive) contain data
#stripe_rows: from 2 to 16 (inclusive) contain data
i=2
while i < len(stripe_rows):
    print('CURRENT LINE IS: ',str(i))
    print(stripe_rows[i])
    print('###############################################')
    print(non_stripe_rows[i+2])
    print('\n')
    i+=1

#GETS LINK TO NEXT PAGE
next_page=str(soup.find('div', attrs={'class':'pageNavigation'}).find('li', attrs={'class':'next'}).find('a')['href']) #GETS LINK TO NEXT PAGE WORKS
complete_next_page=base_url+next_page

I have annotated the bits of code that I have written and understand but I don't really know what's going on in the 'Render' class enough to diagnose the error? Unless its something else?

Here is the error:

result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'

I don't need to keep the information in the class once I have parsed it out so I was thinking perhaps it could be cleared or reset somehow and then updated to hold the new url information from page 2:n but I have no idea how to do this?

Alternatively if anyone knows another way to grab this specific data from this page and the following ones then that would be equally helpful?

Many thanks in advance.


Solution

  • How about using selenium and phantomjs instead of PyQt.
    You can easily get selenium by executing "pip install selenium". If you use Mac you can get phantomjs by executing "brew install phantomjs". If your PC is Windows use choco instead of brew, or Ubuntu use apt-get.

    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    base_url = "https://uk.reuters.com"
    first_page = "/business/markets/index/.FTSE?sortBy=&sortDir=&pn=1"
    
    browser = webdriver.PhantomJS()
    
    # PARSE THE HTML
    browser.get(base_url + first_page)
    soup = BeautifulSoup(browser.page_source, "lxml")
    row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
    
    # PARSE ALL ROW DATA
    stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
    non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
    print(len(stripe_rows), len(non_stripe_rows))
    
    # GO TO THE NEXT PAGE
    next_button = soup.find("li", attrs={"class":"next"})
    while next_button:
      next_page = next_button.find("a")["href"]
      browser.get(base_url + next_page)
      soup = BeautifulSoup(browser.page_source, "lxml")
      row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
      stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
      non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
      print(len(stripe_rows), len(non_stripe_rows))
      next_button = soup.find("li", attrs={"class":"next"})
    
    # DONT FORGET THIS!!
    browser.quit()
    

    I know the code above is not efficient (too slow I feel), but I think that it will bring you the results you desire. In addition, if the web page you want to scrape does not use Javascript, even PhantomJS and selenium are unnecessary. You can use the requests module. However, since I wanted to show you the contrast with PyQt, I used PhantomJS and Selenium in this answer.