Search code examples
pythonweb-scrapingbeautifulsoup

How to extract nested text while keeping line breaks?


I want to extract text from an extremely nested website without any obvious pattern nor classes I can use.

That's why I need to write a logic which is quite "generic" and works in multiple scenarios. That's where I need some support.

If we have for example:

<div><span>Hello<br>World</span>, how are you doing?</div>
<span><span>This<br><br><br>is difficult</span>, at<br>least for me.</span>

... I would like to extract Hello<br>World, how are you doing? as the first element, and then This<br><br><br>is difficult, at<br>least for me.

So it should keep the text (and line breaks) while grouping the elements together.

I tried multiple approaches, the latest:

def is_visible_text(element):
    if isinstance(element, NavigableString):
        # Remove non-visible characters using regex
        text = re.sub(r'[\u200B-\u200D\uFEFF]', '', element)
        return text.strip() != ''
    return False

def extract_deepest_text_elements(element):
    if isinstance(element, NavigableString) and is_visible_text(element):
        return [element]
    if element.name in ['br']:
        return [element]

    # List to hold extracted text and <br> elements
    extracted_elements = []

    # Process child elements first
    for child in element.contents:
        extracted_elements.extend(extract_deepest_text_elements(child))

    return extracted_elements

def refine_content(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as file:
        content = file.read()

    soup = BeautifulSoup(content, 'html.parser')
    new_body_content = soup.new_tag('div')

    # Start with the highest-order elements (div, span, p)
    elements = soup.find_all(['div', 'span', 'p'])
    for elem in elements:
        while elem:
            deepest_elements = extract_deepest_text_elements(elem)
            if deepest_elements:
                for element in deepest_elements:
                    new_body_content.append(element)
                new_body_content.append(soup.new_tag('br'))  # Ensure BRs after text
            # Move up to the parent element
            elem = elem.parent if elem.parent and elem.parent.name != 'body' else None

    new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
    new_soup.body.append(new_body_content)

    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(new_soup.prettify())

... it does not work like intended. Currently the elements multiple times in the output.

I would highly appreciate your take on that challenge.


Solution

  • Without more detailed knowledge of the resource, I would take a relatively simple approach that works with substitution of <br>:

    from bs4 import BeautifulSoup
    
    html = '''<div><span>Hello<br>World</span>, how are you doing?</div>
    <span><span>This<br><br><br>is difficult</span>, at<br>least for me.</span>'''
    
    soup = BeautifulSoup(html,'html.parser')
    
    for br in soup('br'):
        br.replace_with('^^')
    
    new_soup = BeautifulSoup(f"<html><body>{soup.get_text().replace('^^','<br>')}</body></html>")
    new_soup
    

    You could also replace the tags before converting into BeautifulSoup object, but keep in mind that you have to deal with all forms of <br>, <br/>:

    ...
    soup = BeautifulSoup(html.replace('<br>','^^'),'html.parser')
    new_soup = BeautifulSoup(f"<html><body>{soup.get_text().replace('^^','<br>')}</body></html>")
    ...
    

    <html><body>Hello<br/>World, how are you doing?
    This<br/><br/><br/>is difficult, at<br/>least for me.</body></html>