Search code examples
pythonweb-scrapingxpathbeautifulsoup

How to scrape table using beautifulsoup only summary and width?


I am trying to scrape a table from this site.:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm

This table has no id or class and only contains summary and width. Is there any way to scrape this table? Perhaps xpath?

I heard that xpath is not compatible with beautifulsoup and hope that is wrong.

<table width="100%" cellpadding="3" border="1" summary="Layout showing RecallTest table with 6 columns: Date,Brand Name,Product Description,Reason/Problem,Company,Details/Photo" style="margin-bottom:28px">
          <thead>
            <tr>
                    <th scope="col" data-type="numeric" data-toggle="true"> Date </th>
            </tr>
          </thead>
          <tbody>

Here is my code:

import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1,page+1):
   l = link + '?page='+str(p)
    # Downloading contents of the web page
    data = requests.get(l).text
    # Creating BeautifulSoup object
    soup = BeautifulSoup(data, 'html.parser')
    tables = soup.find_all('table')
    table = soup.find('table', INSERT XPATH EXPRESSION)
    df = pd.DataFrame(columns = ['date','brand','descr','reason','company'])
    for row in table.tbody.find_all('tr'):    
        # Find all data for each column
        columns = row.find_all('td')
        if columns != []:
            date = columns[0].text.strip()
                    

Solution

  • Scraping tables it is best practice to use pandas.read_html() that covers 95% of all cases. Simply iterate the sites and concat the dataframes:

    import pandas as pd
    
    url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
    
    pd.concat(
        [pd.read_html(url+'?page='+str(i))[0]for i in range(1,16)],
        ignore_index=True
    )
    

    Note, that you can also include links via extract_links='body'

    This will result in:

    Date Brand Name Product Description Reason/Problem Company Details/Photo
    0 12/31/2015 PharMEDium Norepinephrine Bitartrate added to Sodium Chloride Discoloration PharMEDium Services, LLC nan
    1 12/31/2015 Thomas Produce Cucumbers Salmonella Thomas Produce Company nan
    2 12/28/2015 Wegmans, Uoriki Fresh Octopus Salad Listeria monocytogenes Uoriki Fresh, Inc. nan
    ...
    433 01/05/2015 Whole Foods Market Assorted cookie platters Undeclared tree nuts Whole Foods Market nan
    434 01/05/2015 Eillien's, Blain's Farms and Fleet & more Walnut Pieces Salmonella contamination Eillien’s Candies Inc. nan
    435 01/02/2015 Full Tilt Ice Cream Ice Cream Listeria monocytogenes Full Tilt Ice Cream nan
    436 01/02/2015 Zilks Hummus Undeclared peanuts Zilks Foods nan

    Based on your manually approach simply select the first table, iterate over the rows and store information in a list of dicts , that could be simply converted into a dataframe:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
    
    data = []
    
    for i in range(1,16):
        soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
        for e in soup.table.select('tr:has(td)'):
            data.append({
                'date': e.td.text,
                'any other': 'column',
                'link': e.a.get('href')
            })
    
    data