I have list of adblock rules (example)
How can I apply them to webpage? I download webpage code with MechanicalSoup (based on BeautifulSoup). I would like to save it in bs format, but etree is fine too.
I tried to use following code, but it has problems with some pages:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Virtually the same code as in Nikita's answer, but wanted to share it with all the imports, without mechanicalsoup
dependency for people who'd like to try it out.
from lxml.etree import tostring
import lxml.html
import requests
# take AdRemover code from here:
# https://github.com/buriy/python-readability/issues/43#issuecomment-321174825
from adremover import AdRemover
url = 'https://google.com' # replace it with a url you want to apply the rules to
rule_urls = ['https://easylist-downloads.adblockplus.org/ruadlist+easylist.txt',
'https://filters.adtidy.org/extension/chromium/filters/1.txt']
rule_files = [url.rpartition('/')[-1] for url in rule_urls]
# download files containing rules
for rule_url, rule_file in zip(rule_urls, rule_files):
r = requests.get(rule_url)
with open(rule_file, 'w') as f:
print(r.text, file=f)
remover = AdRemover(*rule_files)
html = requests.get(url).text
document = lxml.html.document_fromstring(html)
remover.remove_ads(document)
clean_html = tostring(document).decode("utf-8")