This page contains the table I want to scrap with BeautifulSoup:
Flavors of Cacao - Chocolate Database
The table is located inside a div
with id spryregion1
, however it couldn't be located with the id, thus instead I located it with the width of the table, then located all the tr
elements.
The column titles are enclosed in th
elements, and each row entries are in td
. I have tried a few ways but couldn't scrape all the rows and put them into a CSV file.
Could someone give me some help/advice? Thanks!
The table you are looking for is not contained in the HTML for the page you are requesting. The page uses Javascript to request another HTML document containing it which it then wraps using the <div>
that you were looking for.
To get the table, you can use a browser tool to spot the URL that the page is requesting and use this to get the page you need:
import requests
from bs4 import BeautifulSoup
import csv
r = requests.get("http://flavorsofcacao.com/database_w_REF.html")
soup = BeautifulSoup(r.content, "html.parser")
with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow([th.get_text(strip=True) for th in soup.table.tr.find_all('th')])
for tr in soup.table.find_all("tr")[1:]:
csv_output.writerow([td.get_text(strip=True) for td in tr.find_all('td')])
From there you can first extract the header row by searching for the <th>
entries and then iterate all the rows. The data could be written to a CSV file using Python's CSV library.
Giving you an output.csv
file starting:
Company (Maker-if known),Specific Bean Origin or Bar Name,REF,Review Date,Cocoa Percent,Company Location,Rating,Bean Type,Broad Bean Origin
A. Morin,Bolivia,797,2012,70%,France,3.5,,Bolivia
A. Morin,Peru,797,2012,63%,France,3.75,,Peru
A. Morin,Brazil,1011,2013,70%,France,3.25,,Brazil
Tested using Python 3.6.3