Search code examples
pythonweb-scrapingselenium-chromedriver

scrape table from a webpage using python


I hope to get the table contents of this website. However, the webpage's design is very special and my code below is only able to get the table in the first page:

I know since there are only three pages I can just copy manually, however I still hope to write a script that can automate the entire process.

driver = webdriver.Chrome()
driver.get(url) 
time.sleep(5)   
html_str = driver.page_source 
soup = bs(html_str, "html.parser")
soup.find("table")

Here is the pagnitor part from soup, I have no experience in web-development and do not understand what actually happens after we click Next.

<ha-paginator data-translation-block="false" data-translation-id="1442"><!-- --><nav aria-label="Page navigation" class="text-center" data-translation-block="false" data-translation-id="1443">
<ul class="pagination" data-translation-block="false" data-translation-id="1444">
<!-- -->
<!-- --><li class="active" data-translation-block="false" data-translation-id="1445">
<!-- --><a data-translated="false" data-translation-checksum="57ad7d2ec0e248914c2b0ae7efc17011d1435f99d807e43b172697027ffe46ce500c3ff64f5162eaa059c11a23fa5d8c442ab67bd219d74311601bed517cf477" href="#"> 1
        <!-- --><span class="sr-only">(current)</span>
</a>
</li><li data-translation-block="false" data-translation-id="1446">
<!-- --><a data-translated="false" data-translation-checksum="7eece0387dc3c6876397df60e2d7dbe0e2c94ecdc42d7e50d5208a4c84885caa703c487d86900ac97f10ad493893db85144cf7889d8ac8fd008dfd4c8f0e98df" href="#"> 2
        <!-- -->
</a>
</li><li data-translation-block="false" data-translation-id="1447">
<!-- --><a data-translated="false" data-translation-checksum="aa08ec665075172d835562b332e78832e7f9d3b7f3df47d5a32b8f3a1682daaed49831faf19eeaca164d8e94e3449ade2a83d83dfaa83878c832f644fea11f95" href="#"> 3
        <!-- -->
</a>
</li><!-- --><li data-translated="false" data-translation-block="true" data-translation-checksum="7d03f54e74b11d46eacd33365a0aa16a3ba2857949c7f795c2d9c07b5689fbc4230dc22c45af2303eba21a7d8016f197d9b474d4149db6d0df059ce00416e192" data-translation-id="1448">
<a href="#">
          Next
        </a>
</li>
</ul>
</nav>
<!-- --></ha-paginator>
<hr class="big" data-translation-block="true" data-translation-id="1449"/>
</div>
</div>
</div>
</ha-table-search>

Solution

  • The data you see on the page is loaded from external URL via JavaScript, so you can get data directly from there:

    import pandas as pd
    import requests
    
    url = "https://immi.homeaffairs.gov.au/_layouts/15/api/data.aspx/GetPriceList"
    
    data = requests.post(url, json={"category": "Visa", "onshore": "All"}).json()
    df = pd.DataFrame(data["d"]["data"])
    
    df.pop("note")
    print(df.head(5))
    

    Prints:

      visaSubclassCode                                           visaSubclassText streamCode streamText onShore    basePrice  over18Price under18Price nonInternetPrice subsequentPrice
    0              100  Partner (Provisional and Migrant) visa (subclass 309/100)                            No  AUD8,850.00  AUD4,430.00  AUD2,215.00              N/A             N/A
    1              101                                  Child visa (subclass 101)                            No  AUD3,055.00  AUD1,530.00    AUD765.00              N/A             N/A
    2              102                               Adoption visa (subclass 102)                            No  AUD3,055.00  AUD1,530.00    AUD765.00              N/A             N/A
    3              117                        Orphan Relative visa (subclass 117)                            No  AUD1,870.00    AUD935.00    AUD470.00              N/A             N/A
    4              124                   Distinguished Talent visa (subclass 124)                            No  AUD4,110.00  AUD2,055.00  AUD1,030.00              N/A             N/A