Search code examples
pythoninternet-explorerbeautifulsoupconditional-commentshtml5lib

BeautifulSoup4 extract all types of conditional comments


What i try to do:

Remove suspicious comments from html mails with bs4. Now i encountered a problem with so called conditional comments of type downlevel-revealed.

See: https://learn.microsoft.com/en-us/previous-versions/windows/internet-explorer/ie-developer/compatibility/ms537512(v=vs.85)#syntax-of-conditional-comments

import bs4

html = 'A<!--[if expression]>a<![endif]-->' \
       'B<![if expression]>b<![endif]>'


soup = bs4.BeautifulSoup(html, 'html5lib')

for comment in soup.find_all(text=lambda text: isinstance(text, bs4.Comment)):
    comment.extract()

Befor extract comments:

'A',
'[if expression]>a<![endif]',
'B',
'[if expression]',
'b',
'[endif]',

After extract comments:

'A',
'B',
'b',

Problem:

The small b should also be removed. Problem is, bs4 detects first comment as one single comment object, but second is detected as 3 objects. Comment(if), NavigableString(b) and Comment(endif). Extraction just removes the both comment types. NavigableString with content 'b' remains in DOM.

Any solution to this?


Solution

  • After some time of reading about conditional comments i can understand why this is happening this way.

    downlevel-hidden

    downlevel-hidden are basically written as normal comment <!-- ... -->. This is detected as conditional comment block in modern browsers. So BeautifulSoup removes it completely if i like to remove conditional comments.

    downlevel-revealed

    downlevel-revealed are written as <!...>b<!...>, modern browsers detect the two tags as invalid and ignore them in DOM, so just b remains valid. So BeautifulSoup removes only the tags, not the content

    Conclusion

    BeautifulSoup handles conditional comments as modern browsers would do. This is perfectly fine for my circumstances.