I am using BeautifulSoup4 to parse HTML string into a structured object, but for each HTML element (e.g. soup.body.title) I wanted there to be a attribute called embed (e.g. soup.body.title.embed).
So I created child classes for Tag and BeautifulSoup with the embed attribute added, and while the type of the root node object is EmbedSoup, which is as I intended, the type of soup.body
is bs4.element.Tag instead of EmbedTag.
How do I make sure that all elements of the BeautifulSoup Tree are of type EmbedTag and not bs4.element.Tag? Is there another solution to this problem that I am having?
class EmbedTag(Tag):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.embed = None # Initialize the embed attribute to None
class EmbedSoup(BeautifulSoup):
def __init__(self, *args, **kwargs):
kwargs['element_classes'] = {'tag': EmbedTag}
super().__init__(*args, **kwargs)
# Parse the HTML with the custom BeautifulSoup class
soup = EmbedSoup(html_content, 'html.parser')
type(soup) --> EmbedSoup
type(soup.body) --> bs4.element.Tag
According to the Beautiful Soup documentation, it looks like you have to supply a dictionary that is a mapping from type
to type
rather than from str
to type
when you want to use custom sub-classes of Beautiful Soup classes:
from bs4 import Tag
class EmbedSoup(BeautifulSoup):
def __init__(self, *args, **kwargs):
kwargs['element_classes'] = {Tag: EmbedTag}
super().__init__(*args, **kwargs)