Search code examples
pythonpandasbeautifulsouppandas.excelwriter

How can I webscrape information in the html element and save it to an Excel row using Beautifulsoup and any excel writer(Pandas)?


I'm new to python and I'm doing it for my project. Can someone help me save it to an excel file?.

This is needed for multiple site URLs so need to add each information into new rows in excel. Sample HTML code is attached below. Please help me on saving it to excel rows and columns and how to iterate it using for loop.


Solution

  • I would suggest you just use openpyxl directly rather than via Pandas, this would give you much greater control over how your Excel file would be formatted.

    Here is how you could build up multiple row in an Excel file:

    import requests
    from bs4 import BeautifulSoup
    from xlwt import Workbook
    import openpyxl
    from openpyxl.styles.borders import Border, Side
    from openpyxl.utils import get_column_letter
    from openpyxl.styles import Alignment
    
    website_url = "https://www.example.com/"
    res = requests.get(website_url, verify=False)
    soup = BeautifulSoup(res.text, 'lxml')
    Links = soup.find_all("a", {"class": "jobTitleLink"},)
    url = [tag.get('href') for tag in Links]
    wb = openpyxl.Workbook()
    
    # Write a header row
    columns = [
        ("SL No", 10),
        ("Job Title", 25),
        ("Company Name", 20),
        ("Posted on", 13),
        ("Closing on", 13),
        ("Location", 20),
        ("Description", 40),
        ("Skills", 70),
        ("Link Email", 30),
    ]
    
    thin_border = Border(left=Side(style='thin'), right=Side(
        style='thin'), top=Side(style='thin'), bottom=Side(style='thin'))
    ws = wb.active
    
    for col_number, (value, width), in enumerate(columns, start=1):
        ws.cell(column=col_number, row=1, value=value).border = thin_border
        ws.column_dimensions[get_column_letter(col_number)].width = width
    
    row_number = 2
    
    # get the first link in the entire page
    # get value of the href attribute
    
    for x in url[1:5]:
        res = requests.get(f'https://www.example/com/{x}', verify=False)
        soup = BeautifulSoup(res.text, 'lxml')
        data = []
    
        for div_block in soup.find_all('div', class_='block', style=None):
            data.append([line.strip() for line in div_block.stripped_strings])
    
        li_fr = soup.find('li', class_="fr")
        company_name = li_fr.a.text
        location = list(li_fr.find_next_sibling('li').stripped_strings)[1]
    
        # Write a data row
        row = [
            '',  # SL No
            data[0][0],  # Job title
            company_name,  # Company name
            data[1][1],
            data[2][1],
            location,
            data[4][1],
            '\n'.join(data[5][1:]),
            data[3][1],
        ]
    
        for col_number, value in enumerate(row, start=1):
            cell = ws.cell(column=col_number, row=row_number, value=value)
            cell.border = thin_border
            cell.alignment = Alignment(wrapText=True)
    
        row_number += 1
    
    wb.save('output.xlsx')
    print('Saved all the data')
    

    This would give you an Excel sheet looking like:

    Excel screenshot

    With extra work you can apply any styling you prefer.