Search code examples
htmlweb-scrapingbeautifulsouptext-extractiondata-extraction

How to extract the data in between 2 header tags?


I'm working on scraping the website and I want to extract the data in between the 2 headers and tag it to first tag as key-value pair.

How to extract the text under headers (like h1 and h2) ?

soup = BeautifulSoup(page.content, 'html.parser')
items = soup.select("div.conWrap")

htag_count = []
item_header = soup.find_all(re.compile('^h[1-6]'))
for item in item_header:
    htag_count.append({item.name:item.text})

print(htag_count)

Solution

  • This won't work if the h_ tags don't share a direct parent , but you could try looping through sibling tags after each h_ tag [and stop if the next h_ tag is reached].

    # url = 'https://en.wikipedia.org/wiki/Chris_Yonge' [ for example ]
    # soup = BeautifulSoup(requests.get(url).content)
    
    # item_header = soup.find_all(re.compile('^h[1-6]')) # should be same as
    item_header = soup.find_all([f'h{i}' for i in range(1,7)])
    
    skipTags = ['script', 'style'] # any tags you don't want text from
    hSections = []
    
    for h in item_header:
        sectionLines = []
    
        for ns in  h.find_next_siblings():
            if ns in item_header: break # stop if/when next header is reached
            if ns.name in skipTags: continue # skip certain tags
    
            sectionLines.append(' '.join(ns.get_text(' ').split())) 
            # [ split+join to minimize whitespace ] 
    
        hSections.append({
            'header_type': h.name, 'header_text': h.get_text(' ').strip(),
            'section_text': '\n'.join([l for l in sectionLines if l])
        })    
    

    I couldn't test this properly since you didn't include any html snippet nor a link to the site you want to scrape, but when tried on a Wikipedia page, hSections (after truncating and tabulating) looks like: sampop

    You can also take a look at this solution if you're interested in nesting subsection into the parent sections.