I want to extract text from an extremely nested website without any obvious pattern nor classes I can use.
That's why I need to write a logic which is quite "generic" and works in multiple scenarios. That's where I need some support.
If we have for example:
<div><span>Hello<br>World</span>, how are you doing?</div>
<span><span>This<br><br><br>is difficult</span>, at<br>least for me.</span>
... I would like to extract Hello<br>World, how are you doing?
as the first element, and then This<br><br><br>is difficult, at<br>least for me.
So it should keep the text (and line breaks) while grouping the elements together.
I tried multiple approaches, the latest:
def is_visible_text(element):
if isinstance(element, NavigableString):
# Remove non-visible characters using regex
text = re.sub(r'[\u200B-\u200D\uFEFF]', '', element)
return text.strip() != ''
return False
def extract_deepest_text_elements(element):
if isinstance(element, NavigableString) and is_visible_text(element):
return [element]
if element.name in ['br']:
return [element]
# List to hold extracted text and <br> elements
extracted_elements = []
# Process child elements first
for child in element.contents:
extracted_elements.extend(extract_deepest_text_elements(child))
return extracted_elements
def refine_content(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as file:
content = file.read()
soup = BeautifulSoup(content, 'html.parser')
new_body_content = soup.new_tag('div')
# Start with the highest-order elements (div, span, p)
elements = soup.find_all(['div', 'span', 'p'])
for elem in elements:
while elem:
deepest_elements = extract_deepest_text_elements(elem)
if deepest_elements:
for element in deepest_elements:
new_body_content.append(element)
new_body_content.append(soup.new_tag('br')) # Ensure BRs after text
# Move up to the parent element
elem = elem.parent if elem.parent and elem.parent.name != 'body' else None
new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
new_soup.body.append(new_body_content)
with open(output_file, 'w', encoding='utf-8') as file:
file.write(new_soup.prettify())
... it does not work like intended. Currently the elements multiple times in the output.
I would highly appreciate your take on that challenge.
Without more detailed knowledge of the resource, I would take a relatively simple approach that works with substitution of <br>
:
from bs4 import BeautifulSoup
html = '''<div><span>Hello<br>World</span>, how are you doing?</div>
<span><span>This<br><br><br>is difficult</span>, at<br>least for me.</span>'''
soup = BeautifulSoup(html,'html.parser')
for br in soup('br'):
br.replace_with('^^')
new_soup = BeautifulSoup(f"<html><body>{soup.get_text().replace('^^','<br>')}</body></html>")
new_soup
You could also replace the tags before converting into BeautifulSoup object, but keep in mind that you have to deal with all forms of <br>
, <br/>
:
...
soup = BeautifulSoup(html.replace('<br>','^^'),'html.parser')
new_soup = BeautifulSoup(f"<html><body>{soup.get_text().replace('^^','<br>')}</body></html>")
...
<html><body>Hello<br/>World, how are you doing?
This<br/><br/><br/>is difficult, at<br/>least for me.</body></html>