I have an HTML file produced by pandoc, where SVG illustrations have been embedded. The SVG content is encoded in base64 and included in the src
attribute of img
elements. It looks like this:
<figure>
<img role="img" aria-label="Figure 1" src="data:image/svg+xml;base64,<base64str>" alt="Figure 1" />
<figcaption aria-hidden="true">Figure 1</figcaption>
</figure>
I'd like to replace the img
element by the decoded SVG string, with BeautifulSoup. So here's what I do:
from bs4 import BeautifulSoup
import base64
with open("file.html") as f:
soup = BeautifulSoup(f, "html.parser")
# get all images
images = soup.find_all("img")
# try with the first one
# decode the SVG string from the src attribute
svg_str = base64.b64decode(images[0]["src"].split(",")[1]).decode()
# replace the tag with the string
images[0].replace_with(soup.new_tag(svg_str))
However, images[0]
remains unchanged, although no error is returned. I've looked at examples in the Internet, but I can't figure out what I'm doing wrong.
The issue you're encountering is due to the way you're trying to replace the img
tag with the decoded SVG string. The soup.new_tag
method is used to create new tags, but you're passing a string to it, which is not the correct usage. Instead, you should directly replace the img
tag with the decoded SVG content.
Here's how you can achieve this:
img
tag with the parsed SVG content.Here's the corrected code:
from bs4 import BeautifulSoup
import base64
with open("file.html") as f:
soup = BeautifulSoup(f, "html.parser")
# get all images
images = soup.find_all("img")
# process each image
for img in images:
# decode the SVG string from the src attribute
svg_str = base64.b64decode(img["src"].split(",")[1]).decode()
# parse the SVG string into a BeautifulSoup object
svg_soup = BeautifulSoup(svg_str, "html.parser")
# replace the img tag with the parsed SVG content
img.replace_with(svg_soup)
# Save the modified HTML to a new file
with open("modified_file.html", "w") as f:
f.write(str(soup))