How to extract nested text while keeping line breaks?

I want to extract text from an extremely nested website without any obvious pattern nor classes I can use.

That's why I need to write a logic which is quite "generic" and works in multiple scenarios. That's where I need some support.

If we have for example:

<div><span>Hello<br>World</span>, how are you doing?</div>
<span><span>This<br><br><br>is difficult</span>, at<br>least for me.</span>

... I would like to extract Hello World, how are you doing? as the first element, and then This is difficult, at least for me.

So it should keep the text (and line breaks) while grouping the elements together.

I tried multiple approaches, the latest:

def is_visible_text(element):
    if isinstance(element, NavigableString):
        # Remove non-visible characters using regex
        text = re.sub(r'[\u200B-\u200D\uFEFF]', '', element)
        return text.strip() != ''
    return False

def extract_deepest_text_elements(element):
    if isinstance(element, NavigableString) and is_visible_text(element):
        return [element]
    if element.name in ['br']:
        return [element]

    # List to hold extracted text and <br> elements
    extracted_elements = []

    # Process child elements first
    for child in element.contents:
        extracted_elements.extend(extract_deepest_text_elements(child))

    return extracted_elements

def refine_content(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as file:
        content = file.read()

    soup = BeautifulSoup(content, 'html.parser')
    new_body_content = soup.new_tag('div')

    # Start with the highest-order elements (div, span, p)
    elements = soup.find_all(['div', 'span', 'p'])
    for elem in elements:
        while elem:
            deepest_elements = extract_deepest_text_elements(elem)
            if deepest_elements:
                for element in deepest_elements:
                    new_body_content.append(element)
                new_body_content.append(soup.new_tag('br'))  # Ensure BRs after text
            # Move up to the parent element
            elem = elem.parent if elem.parent and elem.parent.name != 'body' else None

    new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
    new_soup.body.append(new_body_content)

    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(new_soup.prettify())

... it does not work like intended. Currently the elements multiple times in the output.

I would highly appreciate your take on that challenge.

Solution

Without more detailed knowledge of the resource, I would take a relatively simple approach that works with substitution of  :

from bs4 import BeautifulSoup

html = '''<div><span>Hello<br>World</span>, how are you doing?</div>
<span><span>This<br><br><br>is difficult</span>, at<br>least for me.</span>'''

soup = BeautifulSoup(html,'html.parser')

for br in soup('br'):
    br.replace_with('^^')

new_soup = BeautifulSoup(f"<html><body>{soup.get_text().replace('^^','<br>')}</body></html>")
new_soup

You could also replace the tags before converting into BeautifulSoup object, but keep in mind that you have to deal with all forms of  ,  :

...
soup = BeautifulSoup(html.replace('<br>','^^'),'html.parser')
new_soup = BeautifulSoup(f"<html><body>{soup.get_text().replace('^^','<br>')}</body></html>")
...

<html><body>Hello<br/>World, how are you doing?
This<br/><br/><br/>is difficult, at<br/>least for me.</body></html>