Search code examples
pythonmarkdown

markdownify don't remove comment from tag


I try to convert HTML to Markdown using markdownify. This lib don't remove comment from style tag and I try to understand it.

One of methods of MarkdownConverter class is process_tag, and I think that key somewhere here. See below (I add some prints to check):

def process_tag(self, node, convert_as_inline, children_only=False):
    text = ''

    # markdown headings or cells can't include
    # block elements (elements w/newlines)
    isHeading = html_heading_re.match(node.name) is not None
    isCell = node.name in ['td', 'th']
    convert_children_as_inline = convert_as_inline

    if not children_only and (isHeading or isCell):
        convert_children_as_inline = True
    
    print(f"convert_children_as_inline = {convert_children_as_inline}")

    # Remove whitespace-only textnodes in purely nested nodes
    def is_nested_node(el):
        return el and el.name in ['ol', 'ul', 'li',
                                  'table', 'thead', 'tbody', 'tfoot',
                                  'tr', 'td', 'th']

    if is_nested_node(node):
        for el in node.children:
            # Only extract (remove) whitespace-only text node if any of the
            # conditions is true:
            # - el is the first element in its parent
            # - el is the last element in its parent
            # - el is adjacent to an nested node
            can_extract = (not el.previous_sibling
                           or not el.next_sibling
                           or is_nested_node(el.previous_sibling)
                           or is_nested_node(el.next_sibling))
            if (isinstance(el, NavigableString)
                    and six.text_type(el).strip() == ''
                    and can_extract):
                el.extract()

    # Convert the children first
    for i, el in enumerate(node.children):
        cl = None
        if isinstance(el, Comment):
            cl = "Comment"
        elif isinstance(el, Doctype):
            cl = "Doctype"
        elif isinstance(el, NavigableString):
            cl = "NavigableString"
        else:
            cl = "Other" 
        print(f"{i}, cl = {cl}, el = {el}")
        if isinstance(el, Comment) or isinstance(el, Doctype):
            continue
        elif isinstance(el, NavigableString):
            text += self.process_text(el)
        else:
            text += self.process_tag(el, convert_children_as_inline)

    if not children_only:
        convert_fn = getattr(self, 'convert_%s' % node.name, None)
        if convert_fn and self.should_convert_tag(node.name):
            text = convert_fn(node, text, convert_as_inline)

    return text

My test file consists of two parts:

<style><!-- 1. some style defenitions --></style>

<!-- 2. some style definitions -->

What I see in terminal:

convert_children_as_inline = False
0, cl = NavigableString, el =  
1, cl = Other, el = <style><!-- 1. some style defenitions --></style>
convert_children_as_inline = False
0, cl = NavigableString, el = <!-- 1. some style defenitions -->
2, cl = NavigableString, el = 

3, cl = Comment, el =  2. some style definitions

And what I see in out file:

<!-- 1. some style defenitions -->

Please explain me why converter didn't determine string <!-- 1. some style defenitions --> like a comment. I'm a bit confuse about it because the second part it determine like comment (I want to get an empty out file).


Solution

  • I don't know anything about the reasons for what is happening, but I tryed to deal with the consequences.

    I created derived class and add to it method convert_style that return an empty string:

    def convert_style(self, el, text, convert_as_inline):
        return ""
    

    If you have any suggestions about the question or the solution, I'd glad to see it.