Search code examples
pythonseleniumweb-scrapingbeautifulsoupdryscrape

How do I scrape websites which don't return the source code using Python?


I am trying to scrape the 'ASX code' for announcements made by companies on the Australian Stock Exchange from the following website: http://www.asx.com.au/asx/statistics/todayAnns.do

So far I have tried using BeautifulSoup with the following code:

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
parser = BeautifulSoup(response.content, 'html.parser')
print(parser)

However when I print this, it does not print the same as when I manually go onto the page and view the page source. I have done some googling and looked on stackoverflow and believe that this is due to Javascript running on the page which hides the html code.

However I am unsure how to go about getting around this. Any help would be greatly appreciated.

Thanks in advance.


Solution

  • Try this. All you need to do is let the scraper wait for some moments until the page is loaded cause you perhaps already noticed that the content is being loaded dynamically. However, upon execution you will get the left sided header of the table from that webpage.

    import time
    from bs4 import BeautifulSoup
    from selenium  import webdriver
    
    driver = webdriver.Chrome()
    driver.get('http://www.asx.com.au/asx/statistics/todayAnns.do')
    time.sleep(8)
    
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select('.row'):
        print(item.text)
    driver.quit()
    

    Partial results:

    RLC
    RNE
    PFM
    PDF
    HXG
    NCZ
    NCZ
    

    Btw, I've written and executed this code using python 3.5. So, no issues are there with latest version of python when it comes to bind selenium.