Search code examples
pythonweb-scrapingbeautifulsouptags

bs4 Adding space when adding new tag into another


I'm trying to set strong tag on some text inside p tag. i managed to do this but getting some weird spacing

Working on set design, illustration, graphic design, wardrobe management, prop masters, makeup artists, <strong> special effects supervisors </strong>, and more are some of the responsibilities of this position.

In this example as you can see there is a space inside the strong tag, which making the paragraph look a bit weird with the comma after the space.

my code

                text = el.text
                el.clear()
                match = re.search(r'\b%s\b' % str(
                    keyword), text, re.IGNORECASE)
                start, end = match.start(), match.end()
                el.append(text[:start])
                
                strong_tag = soup.new_tag('strong')
                strong_tag.append(text[start:end])
                el.append(strong_tag)
                
                el.append(text[end:])

Also when saving the html into a file, it's prettified. Is there a way keep it minified ?

After editing the HTML with bs4 I'm doing

return soup.decode('utf-8')

and than saving to html.

the output is like that:

<p>
some text
<strong>strong</strong>
rest of the paragraph
</p>

I would really love to keep it

<p>some text <strong>strong</strong> rest of the paragraph</p>

Hope I find the solution here, Thank's in advance.


Solution

  • Script seems to work, there are no additional spaces and it is not clear, why to .decode('utf-8') - May simply convert your BeautifulSoup object back to a string:

    str(soup)    
    

    Example

    from bs4 import BeautifulSoup
    import re
    
    html = '''<p>some text strong rest of the paragraph</p><p>some text strong rest of the paragraph</p><p>some text strong rest of the paragraph</p>'''
    
    keyword = 'strong'
    
    soup = BeautifulSoup(html, 'html.parser')
    
    for p in soup.select('p'):
        text = p.text
        p.clear()
        match = re.search(r'\b%s\b' % str(
            keyword), text, re.IGNORECASE)
        start, end = match.start(), match.end()
        p.append(text[:start])
    
        strong_tag = soup.new_tag('strong')
        strong_tag.append(text[start:end])
        p.append(strong_tag)
        p.append(text[end:])
    
    str(soup)
    

    Output

    <p>some text <strong>strong</strong> rest of the paragraph</p><p>some text <strong>strong</strong> rest of the paragraph</p><p>some text <strong>strong</strong> rest of the paragraph</p>