Search code examples
pythondataframeweb-scrapingbeautifulsouprequest

BeautifulSoup: iteration over 24 char (from a to z) fails : reducing the complexity to get a first insight into the dataset:


i have a list of insurers in spain - it is collected in 24 rubriques - on a website: See the following

insurandes - espanol: the full list: https://www.unespa.es/en/directory

it is divided into 24 pages: https://www.unespa.es/en/directory/#A https://www.unespa.es/en/directory/#Z

idea - what is aimed: i want to fetch the data from the pages- with BS4 and request - and finally save it into a dataframe: Well - the task of scraping the list from the website using BeautifulSoup (BS4) and requests in Python seems to be apropiate; i think that we need to go the following steps:

a. firstly we need to import necessary libraries: BeautifulSoup, requests, and pandas. b. then we need to use the requests library to get the HTML content of each of the pages that are interesting: i.e. A to Z-page. c. then i use BeautifulSoup to parse the HTML content. d. subsequently i think extracting the relevant information (insurers' names) from the parsed HTML is the next step e. finally i want to store the extracted data in a pandas DataFrame.

but this does not work... - also not for the iteration from A to Z:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/"

# List to store all insurers
all_insurers = []

# Loop through each page (A to Z)
for char in range(65, 91):  # ASCII codes for A to Z
    page_url = f"{base_url}#{chr(char)}"
    insurers = scrape_insurers(page_url)
    all_insurers.extend(insurers)

# Convert the list of insurers to a pandas DataFrame
df = pd.DataFrame({'Insurer': all_insurers})

# Display the DataFrame
print(df.head())

# Save DataFrame to a CSV file
df.to_csv('insurers_spain.csv', index=False)

....it fails with the following results:

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Failed to retrieve data from https://www.unespa.es/en/directory/#B
Failed to retrieve data from https://www.unespa.es/en/directory/#C
Failed to retrieve data from https://www.unespa.es/en/directory/#D
Failed to retrieve data from https://www.unespa.es/en/directory/#E

and so forth and so forth:

well i think it is quite easier to reduce the steps of complexity in the first place.

i think that its better to take one single URL i want to visit. It is just better to test what results we get back with our request. After this is finished, now i can evaluate the request; well i think i can use the beautiful soup lib to check for specific fields in common. well i think that i should avoid to do three things (which can obviously terrible wrong) in one step.

so i do it like so for the first character: for A:

import requests
from bs4 import BeautifulSoup

# Function to scrape insurers from a given URL
def scrape_insurers(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        # Extracting insurer names
        insurers = [insurer.text.strip() for insurer in soup.find_all('h3')]
        return insurers
    else:
        print("Failed to retrieve data from", url)
        return []

# Define the base URL
base_url = "https://www.unespa.es/en/directory/#"

# Define the character we want to fetch data for
char = 'A'

# Construct the URL for the specified character
url = base_url + char

# Fetch and print data for the specified character
insurers_char = scrape_insurers(url)
print(f"Insurers for character '{char}':")
print(insurers_char)

but see the Output here:

Failed to retrieve data from https://www.unespa.es/en/directory/#A
Insurers for character 'A':
[]

Solution

  • Try:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.unespa.es/en/directory/"
    
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
    }
    
    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
    
    data = []
    for c in soup.select(".contact-item"):
        for t in c.select("span, a"):
            t.unwrap()
        c.smooth()
    
        title, *other = c.get_text(separator="|||", strip=True).split("|||")
        data.append(
            {"Title": title, **{(s := d.split(":", maxsplit=1))[0]: s[1] for d in other}}
        )
    
    df = pd.DataFrame(data)
    print(df)
    

    Prints:

                                                                                          Title                         Tfno.                           Fax                                                         Web                                                                                           Dirección                                          Email
    0                               A.M.A., AGRUPACIÓN MUTUAL ASEGURADORA, MUTUA DE SEGUROS APF                  91 343 47 00                (91) 343 47 68                                   http://www.amaseguros.com                                                              VÍA DE LOS POBLADOS, 3 28033  (MADRID)                                            NaN
    1                                                  ABANCA GENERALES DE SEGUROS Y REASEGUROS         881920742 / 881920744                           NaN                                                         NaN                                                  AV. LINARES RIVAS 30, 3º 15005 A CORUÑA (A CORUÑA)                                            NaN
    2                                     ABANCA VIDA Y PENSIONES DE SEGUROS Y REASEGUROS, S.A.                   981 188 075                           NaN                                                         NaN                                         AVENIDA DE LA MARINA, 1-3ª PLANTA 15001 A CORUÑA (A CORUÑA)                                            NaN
    3                                          ADMIRAL EUROPE COMPAÑIA DE SEGUROS S.A.U. (AECS)                           NaN                           NaN                              https://www.admiraleurope.com/                                               RODRÍGUEZ MARÍN, 61 - 1ª PLANTA 28016 MADRID (MADRID)                                            NaN
    4                                    AEGON ESPAÑA, SOCIEDAD ANÓNIMA DE SEGUROS Y REASEGUROS                  91 563 62 22                           NaN                                         http://www.aegon.es                 VÍA DE LOS POBLADOS, 3 - EDIFICIO 4B - PARQUE EMPRESARIAL CRISTALIA 28033  (MADRID)                                            NaN
    5                                          AGROPELAYO SOCIEDAD DE SEGUROS, SOCIEDAD ANÓNIMA                           NaN                           NaN                                                         NaN                                                             SANTA ENGRACIA, 67 - 69 28010  (MADRID)                                            NaN
    
    
    ...