Search code examples
pythonhtmlbeautifulsouphtml-parsinglxml

bs4 beautiful soup does not find div's with div's for some reason


the HTML:

<div id="divTradeHaltResults"> 
<div class="genTable"
   <table>
    <tbody>
    <tr> 
	 <td> 03/10/2020  </td> 
	 <td> 15:11:45     </td>

the Code:

url = r'https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts'
r=requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
table = soup.find('div',{'id':'divTradeHaltResults'})
divclass=table.find('div',{'class':"genTable"})

divt=divclass.find('table')

result:

divclass={None Type}None

I have tried the 'lxml' parser to no avail.
I can get it using Selenium but it uses too many resources.
From checking all other problems around multiple div's
there seems to be and inherent problem with bs4.
Has anyone solved it ? I have tried multiple ideas from other
people.


Solution

  • The reason why you are getting None, that's due the page is loaded dynamically via JavaScript which is rendered once the page itself loads.

    Therefor I've been able to track the origin of the table which from where the JS sending an XHR request to obtain it. that's can be tracked via your Browser Developer-Tools under Network-Tab.

    Otherwise you can use selenium for that. I've included both solution for you.

    import requests
    import pandas as pd
    
    
    json = {
        "id": 2,
        "method": "BL_TradeHalt.GetTradeHalts",
        "params": "[]",
        "version": "1.1"
    }
    
    headers = {
        'Referer': 'https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts'
    }
    
    r = requests.post(
        "https://www.nasdaqtrader.com/RPCHandler.axd", json=json, headers=headers).json()
    
    df = pd.read_html(r["result"])[0]
    
    df.to_csv("table1.csv", index=False)
    

    Output: view-online

    from selenium import webdriver
    from selenium.webdriver.firefox.options import Options
    from bs4 import BeautifulSoup
    import pandas as pd
    
    
    options = Options()
    options.add_argument('--headless')
    driver = webdriver.Firefox(options=options)
    
    driver.get(
        "https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts")
    
    df = pd.read_html(driver.page_source)[2]
    
    # print(df)
    df.to_csv("table.csv", index=False)
    driver.quit()
    

    Output: view-online