Search code examples
pythonpython-3.xbeautifulsoupautomationdata-extraction

Extracting table data from web using python


I need to extract a table from a website "https://geniusimpex.org/pakistan-import-data/" which has like thousands of rows, so I wanted to automate the process using bs4 and selenium but when I extract the table only the table header is extracted. This is the code I used

from bs4 import BeautifulSoup   
from urllib.request import urlopen

url = "https://geniusimpex.org/pakistan-import-data/"

html = urlopen(url)

soup = BeautifulSoup(html, 'lxml')  
type(soup)  
soup.prettify()  
print(soup.find_all('tr'))  

It shows the following output
[1]: https://i.sstatic.net/GItzv.png

As you can see only first row is extracted. Can someone please tell me why I am not able to extract the table and how can I do so? It will be very helpful. Sorry if I am not clear or couldn't explain my problem. This is the first time I am asking a question on stack overflow.


Solution

  • The data is loaded from external URL as Json. You can use this script to load the information:

    import json
    import requests
    
    
    url = 'https://geniusimpex.org/wp-admin/admin-ajax.php?action=ge_forecast_list_data&order=asc&offset={offset}&limit=1000'
    
    offset = 0
    while True:
        data = requests.get(url.format(offset=offset)).json()
    
        # print data to screen:
        for row in data.get('rows', []):
            for k, v in row.items():
                print('{:<30} {}'.format(k, v))
            print('-' * 80)
    
        if len(data.get('rows', [])) != 1000:
            break
    
        offset += 1000
    

    Prints:

    ...
    
    --------------------------------------------------------------------------------
    count                          T
    importer_name                  <span file_id="27893" post_count="T" post_id="2157293">BISMILLAH STEEL FURNACE \n NEAR GRID STATION DEEWAN</span>
    goods_description              IRON AND STEEL REMELTABLE SCRAP HARMONIZED CODE: 7204.4990 REFERENCE NUMBER:UM/PAK/5146A ITN: X20200629019843 NWT WEIGHT-19.650 MT SHIPPERS LOAD, STOWAGE AND COUNT
    hs_code                        
    shipment_port                   NEWARK  APT/NEW 
    gross_weight                    19.65 
    number_of_packages              1 
    unit_of_packages                PACKAGES 
    size_of_container               1 X 20FT 
    imported_from_name             SEALINK INTERNATIONAL INC C/O\n UNIVERSAL METALS, ,
    bill_of_lading_number           SII145321 
    bill_of_lading_date            <span data="10-08-2020">10-08-2020</span>
    --------------------------------------------------------------------------------
    count                          T
    importer_name                  <span file_id="27938" post_count="T" post_id="2159597">ASAD SHAHZAD S/O FAQIR ZADA</span>
    goods_description              1 USED VEHICLE TOYOTA VITZ CHASSIS NO: KSP130 -2204837
    hs_code                        NA
    shipment_port                   NAGOYA,  AICHI 
    gross_weight                    .97 
    number_of_packages              1 
    unit_of_packages                UNIT 
    size_of_container               1 X 40FT 
    imported_from_name             KASHMIR MOTORS , 3055-9-104 KUZUTSUKA NIIGATA KITA
    bill_of_lading_number           TA200716H06- 10 
    bill_of_lading_date            <span data="10-08-2020">10-08-2020</span>
    --------------------------------------------------------------------------------
    
    
    ...
    

    EDIT: To save to CSV, you can use this script:

    import json
    import requests
    import pandas as pd
    
    
    url = 'https://geniusimpex.org/wp-admin/admin-ajax.php?action=ge_forecast_list_data&order=asc&offset={offset}&limit=1000'
    
    offset = 0
    all_data = []
    while True:
        data = requests.get(url.format(offset=offset)).json()
    
        # print data to screen:
        for row in data.get('rows', []):
            all_data.append(row)
            for k, v in row.items():
                print('{:<30} {}'.format(k, v))
            print('-' * 80)
    
        if len(data.get('rows', [])) != 1000:
            break
    
        offset += 1000
    
    df = pd.DataFrame(all_data)
    df.to_csv('data.csv')
    

    Produces:

    enter image description here