Search code examples
pythonwebweb-scrapinghtml-table

Scraping Tables on a Web page with BeautifulSoap


I need to do a DataFrame in Python with the information of Top 500 Americas Companies:

https://www.americaeconomia.com/negocios-industrias/estas-son-las-500-mayores-empresas-de-america-latina-2021

I tried to do web scraping and when I print(tabla) it said [] or None...

from bs4 import BeautifulSoup
import requests

url = 'https://www.americaeconomia.com/negocios-industrias/estas-son-las-500-mayores-empresas-de-america-latina-2021'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')

tabla = soup.find('table', {"id":"awesomeTable"})
print(tabla)

Solution

  • What happens?

    Always look in your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the development tools.

    You won't find the table in your soup, cause it is in iframe.

    How to fix?

    Use the url of the iframe source to perform your request:

    https://rk.americaeconomia.com/display/embed/500-latam/2021
    

    Example

    import requests
    from bs4 import BeautifulSoup
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
    r = requests.get('https://rk.americaeconomia.com/display/embed/500-latam/2021',headers=headers)
    soup = BeautifulSoup(r.text,'lxml')
    data = []
    for row in soup.select('#awesomeTable tbody tr.dataRow'):
        data.append(list(row.stripped_strings))
    
    pd.DataFrame(data, columns=list(soup.select_one('#awesomeTable tr').stripped_strings))
    

    Output

    RK 2021 EMPRESA PAÍS
    1 PETROBRAS BRA
    2 JBS BRA
    3 AMÉRICA MÓVIL MX
    4 PEMEX MX
    5 VALE BRA
    ... ... ...