Iterate over all elements in html and replace content with Beautifulsoup

In my database I am storing HTML coming from a custom CMS's WYSIWYG editor. The contents are in English and I'd like to use Beautifulsoup to iterate over every single element, translate its contents to German (using another class, Translator) and replace the value of the current element with the translated text.

So far, I have been able to come up with specific selectors for p, a, pre in combination with the .findAll function of Beautifulsoup, however I have googled and it is not clear to me how I can simply go through all elements and replace their content on the fly, instead of having to filter based on a specific type.

A very basic example of HTML produced by the editor covering all different kinds of types:

<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>

The bs4 documentation points me to a replace_with function, which would be ideal if I could only select each element after each other, not having to specifically select something.

Pointers would be welcome 😊

Solution

You can basically do this to iterate over every element :

html="""
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
"""

from bs4 import BeautifulSoup

soup=BeautifulSoup(html,"lxml")
for x in soup.findAll():
    print(x.text)
    # You can try this as well
    print(x.find(text=True,recursive=False))
    # I think this will return result as you expect.

Output :

Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text 

Quote
text after quote


code

text after code

This is a search engine



Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text 

Quote
text after quote


code

text after code

This is a search engine



Heading 1
Heading 2
Heading 3
Normal text
Bold text
Bold text
Italic text 
Italic text 


Quote
text after quote




code


text after code


This is a search engine
This is a search engine

And I believe you have translator function and you know how to replace that also.