I would like to use bleach to format some potentially unclean HTML. In the following sample, ideally bleach should remove:
<p >
</a attr="test">
</p >
My code looks like this:
import bleach
html = """<p >This <a href="book"> book </a attr="test"> will help you</p >"""
html_cleaned = bleach.clean(html)
# html_cleaned is:
#'<p >This <a href="book"> book </a> will help you</p>'
As you can see, bleach is very inconsistent:
p
tag are escaped to <
and >
. For the link tag, this doesn't happen</p >
are removed, in the opening <p >
they are notp
tag, </p attr="test">
, it is not removed, while for the closing </a attr="test">
the illegal attribute is removed.What is happening here ?
bleach.clean
expects an optional tags
parameter which specifies allowed tags.
The p
tag is not allowed by default and therefore doesn't get the sanitizing treatment.
My problem can be fixed by:
cleaned_doc = bleach.clean(input_doc, tags = bleach.sanitizer.ALLOWED_TAGS+["p"])