I try to convert HTML to Markdown using markdownify
. This lib don't remove comment from style tag and I try to understand it.
One of methods of MarkdownConverter class is process_tag
, and I think that key somewhere here. See below (I add some prints to check):
def process_tag(self, node, convert_as_inline, children_only=False):
text = ''
# markdown headings or cells can't include
# block elements (elements w/newlines)
isHeading = html_heading_re.match(node.name) is not None
isCell = node.name in ['td', 'th']
convert_children_as_inline = convert_as_inline
if not children_only and (isHeading or isCell):
convert_children_as_inline = True
print(f"convert_children_as_inline = {convert_children_as_inline}")
# Remove whitespace-only textnodes in purely nested nodes
def is_nested_node(el):
return el and el.name in ['ol', 'ul', 'li',
'table', 'thead', 'tbody', 'tfoot',
'tr', 'td', 'th']
if is_nested_node(node):
for el in node.children:
# Only extract (remove) whitespace-only text node if any of the
# conditions is true:
# - el is the first element in its parent
# - el is the last element in its parent
# - el is adjacent to an nested node
can_extract = (not el.previous_sibling
or not el.next_sibling
or is_nested_node(el.previous_sibling)
or is_nested_node(el.next_sibling))
if (isinstance(el, NavigableString)
and six.text_type(el).strip() == ''
and can_extract):
el.extract()
# Convert the children first
for i, el in enumerate(node.children):
cl = None
if isinstance(el, Comment):
cl = "Comment"
elif isinstance(el, Doctype):
cl = "Doctype"
elif isinstance(el, NavigableString):
cl = "NavigableString"
else:
cl = "Other"
print(f"{i}, cl = {cl}, el = {el}")
if isinstance(el, Comment) or isinstance(el, Doctype):
continue
elif isinstance(el, NavigableString):
text += self.process_text(el)
else:
text += self.process_tag(el, convert_children_as_inline)
if not children_only:
convert_fn = getattr(self, 'convert_%s' % node.name, None)
if convert_fn and self.should_convert_tag(node.name):
text = convert_fn(node, text, convert_as_inline)
return text
My test file consists of two parts:
<style><!-- 1. some style defenitions --></style>
<!-- 2. some style definitions -->
What I see in terminal:
convert_children_as_inline = False
0, cl = NavigableString, el =
1, cl = Other, el = <style><!-- 1. some style defenitions --></style>
convert_children_as_inline = False
0, cl = NavigableString, el = <!-- 1. some style defenitions -->
2, cl = NavigableString, el =
3, cl = Comment, el = 2. some style definitions
And what I see in out file:
<!-- 1. some style defenitions -->
Please explain me why converter didn't determine string <!-- 1. some style defenitions -->
like a comment. I'm a bit confuse about it because the second part it determine like comment (I want to get an empty out file).
I don't know anything about the reasons for what is happening, but I tryed to deal with the consequences.
I created derived class and add to it method convert_style
that return an empty string:
def convert_style(self, el, text, convert_as_inline):
return ""
If you have any suggestions about the question or the solution, I'd glad to see it.