I am trying now for a while and I am stuck. My site has the following structure (unfortunately I only have a screenshot, somehow I can't manage to copypaste the code...)
EDIT: sorry, sure, here is one of the URLs:
I have found the div class="field field etc.... I want to store everything in 'strong' or "h4" as a the data frames column names (got that part) and the according text to it. I was partially successful, I only lost the second
Tag content under "Project Objective" and I am totally lost with the "Partners" and the text between the
tags.
That is what I did:
content = soup.find_all('div', class_='field field--text_default field--body')
# For the headers:
headers = content[0].find_all(["strong","h4"])
col_names = []
for header in headers:
col_names.append(header.text)
# and for the content:
con = []
divs = content[0].findAll(["strong", "h4"])
for el in divs:
con.append(el.nextSibling)
con = [el.string for el in inhalt if el != None]
It is modification of @Sebastian version.
I keep all on one list data
as pairs (header, text)
but I don't add it directly to this list.
When I find header
then I keep it - in separated variable header
. When I find text
then I also keep it - in separated list text
. And only when I find next header
then I add previous header, text
to data
. And at the end I has to add last header, text
to data
. I also use header = None
to recognize if I found fist header and not add empty pairs header, text
.
Because I keep all text
as list so I can later decide if I want to display in one line or separated lines (like for --
in Partners
)
I also add code for <a>
to get email address. I was thining about adding also code for <br>
.
import requests
import bs4
from bs4 import BeautifulSoup as BS
url = 'https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
content = soup.find_all('div', class_='field field--text_default field--body')
#print(content)
data = [] # list for pairs `(header, text)`
header = None # last found `header`
text = [] # all text found after last `header`
all_tags = content[0].find_all(["p","h4"])
for tag in all_tags:
for child in tag.children:
if isinstance(child, bs4.element.Tag):
if child.name in "strong":
# put previouse `header + text`
if header is not None: # don't before first header
data.append( [header, text] )
# remember new `header` and make place for new text
header = child.get_text().strip(": ")
text = []
#if child.name in "br":
# text.append('\n')
if child.name in "a":
text.append(child.get_text().strip())
if isinstance(child, bs4.element.NavigableString):
if child in ("Project Objective", "Project Impact", "Contacts"):
# put previouse `header + text`
if header is not None: # don't before first header
data.append( [header, text] )
# remember new `header` and make place for new text
header = child.strip()
text = []
else:
# remember `text`
text.append(child.strip())
# add last `header + text`
if header is not None: # don't before first header
data.append( [header, text] )
# --- display ---
print('len(data):', len(data), '\n')
for header, text in data:
print('header:', header)
print('--- text ---')
#print(' '.join(text).strip('\n'))
if header == 'Partners':
print('\n'.join(text))
else:
print(' '.join(text))
print('====================================')
Result:
Only header Contact
is empty because elements are in headers DOE Technology Manager
and Lead Performer
len(data): 11
header: Lead Performer
--- text ---
Cold Climate Housing Research Center – Fairbanks, AK
====================================
header: Partners
--- text ---
-- Panasonic Corp. – Newark, NJ
-- Taġiuġmiullu Nunamiullu Housing Authority – Utqiagvik, AK
-- National Renewable Energy Laboratory, Golden, CO
====================================
header: DOE Total Funding
--- text ---
$375,161
====================================
header: Cost Share
--- text ---
$95,293
====================================
header: Project Term
--- text ---
July 2020 – May 2022
====================================
header: Funding Type
--- text ---
Advanced Building Construction FOA Award
====================================
header: Project Objective
--- text ---
Vacuum insulated panels (VIPs) are poised to transform the building industry by making homes more energy efficient with little additional upfront cost. However, they are currently uncommon due to their inherent fragility. As the R-value relies on the vacuum inside the panel, any damage to the panel negates the insulation value of the system. With today’s residential construction methods and fastener technology, it is nearly impossible to avoid damaging panels during assembly or over the life of the home. These issues make VIPs incompatible with current construction techniques. To overcome these issues and capitalize on the high R-value of VIPs, the project team will develop a new building system with durable assemblies that can perform in Arctic conditions. The long-term plan is to make the system a mass-market building platform that can address the need for affordable, high-efficiency housing across the nation. This starts with a proof of concept that will be built and tested at the Cold Climate Housing Research Center in Fairbanks, Alaska. Developing this concept in the country’s only Arctic state, which has the coldest temperatures and highest energy costs in the U.S., will ensure its durability and performance in other climates.
====================================
header: Project Impact
--- text ---
The energy-savings payback of this system is estimated to be eight years with applicability and potential benefit in every U.S. climate zone. For remote regions such as central Alaska, the payback would be even shorter as the cost of energy exceeds the assumed retail energy cost. Considering the building envelope alone, this system is expected to achieve a reduction in heating/cooling energy of at least 48% and an annual savings of 1,637 TBtu if implemented nationwide.
====================================
header: Contacts
--- text ---
====================================
header: DOE Technology Manager
--- text ---
Marc LaFrance, Marc.Lafrance@ee.doe.gov
====================================
header: Lead Performer
--- text ---
Bruno Grunau, Cold Climate Housing Research Center
====================================