Search code examples

BeautifulSoup4 extract all types of conditional comments

What i try to do:

Remove suspicious comments from html mails with bs4. Now i encountered a problem with so called conditional comments of type downlevel-revealed.


import bs4

html = 'A<!--[if expression]>a<![endif]-->' \
       'B<![if expression]>b<![endif]>'

soup = bs4.BeautifulSoup(html, 'html5lib')

for comment in soup.find_all(text=lambda text: isinstance(text, bs4.Comment)):

Befor extract comments:

'[if expression]>a<![endif]',
'[if expression]',

After extract comments:



The small b should also be removed. Problem is, bs4 detects first comment as one single comment object, but second is detected as 3 objects. Comment(if), NavigableString(b) and Comment(endif). Extraction just removes the both comment types. NavigableString with content 'b' remains in DOM.

Any solution to this?


  • After some time of reading about conditional comments i can understand why this is happening this way.


    downlevel-hidden are basically written as normal comment <!-- ... -->. This is detected as conditional comment block in modern browsers. So BeautifulSoup removes it completely if i like to remove conditional comments.


    downlevel-revealed are written as <!...>b<!...>, modern browsers detect the two tags as invalid and ignore them in DOM, so just b remains valid. So BeautifulSoup removes only the tags, not the content


    BeautifulSoup handles conditional comments as modern browsers would do. This is perfectly fine for my circumstances.