Search code examples
pythonbeautifulsouptagsscreen-scraping

Using BeautifulSoup to get tags and text


I am trying now for a while and I am stuck. My site has the following structure (unfortunately I only have a screenshot, somehow I can't manage to copypaste the code...)

EDIT: sorry, sure, here is one of the URLs:

https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system

enter image description here

I have found the div class="field field etc.... I want to store everything in 'strong' or "h4" as a the data frames column names (got that part) and the according text to it. I was partially successful, I only lost the second

Tag content under "Project Objective" and I am totally lost with the "Partners" and the text between the
tags. That is what I did:

content = soup.find_all('div', class_='field field--text_default field--body')

# For the headers:
headers = content[0].find_all(["strong","h4"])
col_names = []
for header in headers:
    col_names.append(header.text)

# and for the content:
con = []
divs = content[0].findAll(["strong", "h4"])
for el in divs:
    con.append(el.nextSibling)
con = [el.string for el in inhalt if el != None]

Solution

  • It is modification of @Sebastian version.

    I keep all on one list data as pairs (header, text) but I don't add it directly to this list.

    When I find header then I keep it - in separated variable header. When I find text then I also keep it - in separated list text. And only when I find next header then I add previous header, text to data. And at the end I has to add last header, text to data. I also use header = None to recognize if I found fist header and not add empty pairs header, text.

    Because I keep all text as list so I can later decide if I want to display in one line or separated lines (like for -- in Partners)

    I also add code for <a> to get email address. I was thining about adding also code for <br>.

    import requests
    import bs4
    from bs4 import BeautifulSoup as BS
    
    url = 'https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system'
    
    r = requests.get(url)
    
    soup = BS(r.text, 'html.parser')
    
    content = soup.find_all('div', class_='field field--text_default field--body')
    #print(content)
    
    data = []   # list for pairs `(header, text)`
    
    header = None  # last found `header`
    text = []      # all text found after last `header`
    
    
    all_tags = content[0].find_all(["p","h4"])
    
    for tag in all_tags:
    
        for child in tag.children:
            if isinstance(child, bs4.element.Tag):
                if child.name in "strong":
                    # put previouse `header + text`
                    if header is not None:  # don't before first header
                        data.append( [header, text] )
    
                    # remember new `header` and make place for new text
                    header = child.get_text().strip(": ")
                    text = []
    
                #if child.name in "br":
                #    text.append('\n')
                    
                if child.name in "a":
                    text.append(child.get_text().strip())
    
            if isinstance(child, bs4.element.NavigableString):
                if child in ("Project Objective", "Project Impact", "Contacts"):
                    # put previouse `header + text`
                    if header is not None:  # don't before first header
                        data.append( [header, text] )
    
                    # remember new `header` and make place for new text
                    header = child.strip()
                    text = []
                else:
                    # remember `text`
                    text.append(child.strip())
    
    # add last `header + text`
    if header is not None:  # don't before first header
        data.append( [header, text] )
    
    # --- display ---
    
    print('len(data):', len(data), '\n')
    
    for header, text in data:
        print('header:', header)
        print('--- text ---')
        #print(' '.join(text).strip('\n'))
        if header == 'Partners':
            print('\n'.join(text))
        else:        
            print(' '.join(text))
        print('====================================')
    

    Result:

    Only header Contact is empty because elements are in headers DOE Technology Manager and Lead Performer

    len(data): 11 
    
    header: Lead Performer
    --- text ---
    Cold Climate Housing Research Center – Fairbanks, AK
    ====================================
    header: Partners
    --- text ---
    -- Panasonic Corp. – Newark, NJ
    -- Taġiuġmiullu Nunamiullu Housing Authority – Utqiagvik, AK
    -- National Renewable Energy Laboratory, Golden, CO
    ====================================
    header: DOE Total Funding
    --- text ---
    $375,161
    ====================================
    header: Cost Share
    --- text ---
    $95,293
    ====================================
    header: Project Term
    --- text ---
    July 2020 – May 2022
    ====================================
    header: Funding Type
    --- text ---
    Advanced Building Construction FOA Award
    ====================================
    header: Project Objective
    --- text ---
    Vacuum insulated panels (VIPs) are poised to transform the building industry by making homes more energy efficient with little additional upfront cost. However, they are currently uncommon due to their inherent fragility. As the R-value relies on the vacuum inside the panel, any damage to the panel negates the insulation value of the system. With today’s residential construction methods and fastener technology, it is nearly impossible to avoid damaging panels during assembly or over the life of the home. These issues make VIPs incompatible with current construction techniques. To overcome these issues and capitalize on the high R-value of VIPs, the project team will develop a new building system with durable assemblies that can perform in Arctic conditions. The long-term plan is to make the system a mass-market building platform that can address the need for affordable, high-efficiency housing across the nation. This starts with a proof of concept that will be built and tested at the Cold Climate Housing Research Center in Fairbanks, Alaska. Developing this concept in the country’s only Arctic state, which has the coldest temperatures and highest energy costs in the U.S., will ensure its durability and performance in other climates.
    ====================================
    header: Project Impact
    --- text ---
    The energy-savings payback of this system is estimated to be eight years with applicability and potential benefit in every U.S. climate zone. For remote regions such as central Alaska, the payback would be even shorter as the cost of energy exceeds the assumed retail energy cost. Considering the building envelope alone, this system is expected to achieve a reduction in heating/cooling energy of at least 48% and an annual savings of 1,637 TBtu if implemented nationwide.
    ====================================
    header: Contacts
    --- text ---
    
    ====================================
    header: DOE Technology Manager
    --- text ---
    Marc LaFrance, [email protected] 
    ====================================
    header: Lead Performer
    --- text ---
    Bruno Grunau, Cold Climate Housing Research Center
    ====================================