html python-3.x web-scraping data-cleaning newspaper3k

How do I remove unwanted classes and tags from newspaper3k object?

I want to extract news article contents and I'm currently using newspaper3k library:

a = Article(url, memoize_articles=False, language='en')
a.download()
a.parse()
content = a.text

But for some websites, there are unwanted elements like advertisements and text from images. So I want to remove those unwanted elements and text. Is there a way to remove all the content from those tags and classes?

Solution

If you want to do it for a particular website, you can use a.top_node, find out XPath or CSS selector of the advertisement, and then remove them.

ads = a.top_node.xpath("./foo")  # find a proper selector
for ad in ads:
    ad.getparent().remove(ad)

# and now conver top_node to text again somehow, probably using
# OutputFormatter

See https://github.com/codelucas/newspaper/blob/56de65af9efbfea6293c82c0b1821e2ca9fbddaa/newspaper/article.py#L281

It could be also possible to implement a custom DocumentCleaner and put this logic there.

In general, this is a hard problem, probably the hardest one in article extraction, if you want to do it in a generic and robust way, without writing and maintaining rules for each website. Open Source libraries can often find the main content with reasonable quality, but they are pretty bad at excluding extra stuff from article body. See https://github.com/scrapinghub/article-extraction-benchmark and https://github.com/scrapinghub/article-extraction-benchmark/releases/download/v1.0.0/paper-v1.0.0.pdf report.

Commercial tools like AutoExtract by Scrapinghub (I work there) solve this issue; they use computer vision and machine learning, as it is hard to solve this problem reliably otherwise.