In my database I am storing HTML coming from a custom CMS's WYSIWYG editor. The contents are in English and I'd like to use Beautifulsoup to iterate over every single element, translate its contents to German (using another class, Translator) and replace the value of the current element with the translated text.
So far, I have been able to come up with specific selectors for p, a, pre in combination with the .findAll function of Beautifulsoup, however I have googled and it is not clear to me how I can simply go through all elements and replace their content on the fly, instead of having to filter based on a specific type.
A very basic example of HTML produced by the editor covering all different kinds of types:
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
The bs4 documentation points me to a replace_with function, which would be ideal if I could only select each element after each other, not having to specifically select something.
Pointers would be welcome 😊
You can basically do this to iterate over every element :
html="""
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
for x in soup.findAll():
print(x.text)
# You can try this as well
print(x.find(text=True,recursive=False))
# I think this will return result as you expect.
Output :
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text
Quote
text after quote
code
text after code
This is a search engine
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text
Quote
text after quote
code
text after code
This is a search engine
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Bold text
Italic text
Italic text
Quote
text after quote
code
text after code
This is a search engine
This is a search engine
And I believe you have translator function and you know how to replace that also.