Search code examples
pythonbeautifulsouppython-requestsfinance

HTTP Error 404 when scraping first table using BeautifulSoup, but second table works fine


I’m working on a Python script to scrape historical CDS data from Investing.com using BeautifulSoup. The goal is to extract data from a specific table on the page and compile it into a DataFrame.

Here’s the core part of my code:

lista_cds = ['cds-1-year', 'cds-2-year', 'cds-3-year',
         'cds-4-year', 'cds-5-year', 'cds-7-year', 'cds-10-year']

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/106.0.0.0 Safari/537.36'}

lista_dfs = []

for ano_cds in lista_cds:

   url = f'https://br.investing.com/rates-bonds/brazil-{ano_cds}-usd-historical-data'

   req = Request(url, headers=headers)
   page = urlopen(req)
   soup = BeautifulSoup(page, features='lxml')

   
   table = soup.find_all("table")[0]

   df_cds = pd.read_html(StringIO(str(table)))[0][['Último', 'Data']]

Problem: When I attempt to scrape data from the first table (tables[0]), I receive an HTTP Error 404: Not Found. However, when I switch to the second table (tables[1]), the code works perfectly fine, but that’s not the table I need.

Interestingly, someone else ran the exact same code, targeting tables[0], and it worked perfectly for them. This leads me to believe the issue might not be with the code itself but potentially with something environment-specific or a peculiar response from the server.

But i am not sure if maybe the person is lying or something else.

My environment:

  • Vscode
  • Python Version: 3.11.5
  • Windows 11

Solution

  • You have wrong values in lista_cds, it should be years instead of year for all elements except cds-1-year.

    You can also use pandas.read_html directly without urllib/BeautifulSoup.

    Try this code:

    import pandas as pd 
    
    lista_cds = ['cds-1-year', 'cds-2-years', 'cds-3-years', 'cds-4-years', 'cds-5-years', 'cds-7-years', 'cds-10-years']
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36'}
    
    url = 'https://br.investing.com/rates-bonds/brazil-{}-usd-historical-data'
    lista_dfs = [pd.read_html(url.format(ano_cds), storage_options=headers)[0][['Último', 'Data']] for ano_cds in lista_cds]
    
    print(lista_dfs)