Search code examples
pythonhtmlbeautifulsouphtml-parsing

Iterate over all elements in html and replace content with Beautifulsoup


In my database I am storing HTML coming from a custom CMS's WYSIWYG editor. The contents are in English and I'd like to use Beautifulsoup to iterate over every single element, translate its contents to German (using another class, Translator) and replace the value of the current element with the translated text.

So far, I have been able to come up with specific selectors for p, a, pre in combination with the .findAll function of Beautifulsoup, however I have googled and it is not clear to me how I can simply go through all elements and replace their content on the fly, instead of having to filter based on a specific type.

A very basic example of HTML produced by the editor covering all different kinds of types:

<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>

The bs4 documentation points me to a replace_with function, which would be ideal if I could only select each element after each other, not having to specifically select something.

Pointers would be welcome 😊


Solution

  • You can basically do this to iterate over every element :

    html="""
    <h1>Heading 1</h1>
    <h2>Heading 2</h2>
    <h3>Heading 3</h3>
    <p>Normal text</p>
    <p><strong>Bold text</strong></p>
    <p><em>Italic text </em></p>
    <p><br></p>
    <blockquote>Quote</blockquote>
    <p>text after quote</p>
    <p><br></p>
    <p><br></p>
    <pre class="code-syntax" spellcheck="false">code</pre>
    <p><br></p>
    <p>text after code</p>
    <p><br></p>
    <p><a href="https://google.com/" target="_blank">This is a search engine</a></p>
    <p><br></p>
    <p><img src="https://via.placeholder.com/350x150"></p>
    """
    
    from bs4 import BeautifulSoup
    
    soup=BeautifulSoup(html,"lxml")
    for x in soup.findAll():
        print(x.text)
        # You can try this as well
        print(x.find(text=True,recursive=False))
        # I think this will return result as you expect.
    

    Output :

    Heading 1
    Heading 2
    Heading 3
    Normal text
    Bold text
    Italic text 
    
    Quote
    text after quote
    
    
    code
    
    text after code
    
    This is a search engine
    
    
    
    Heading 1
    Heading 2
    Heading 3
    Normal text
    Bold text
    Italic text 
    
    Quote
    text after quote
    
    
    code
    
    text after code
    
    This is a search engine
    
    
    
    Heading 1
    Heading 2
    Heading 3
    Normal text
    Bold text
    Bold text
    Italic text 
    Italic text 
    
    
    Quote
    text after quote
    
    
    
    
    code
    
    
    text after code
    
    
    This is a search engine
    This is a search engine
    
    
    
    

    And I believe you have translator function and you know how to replace that also.