Search code examples
pythonpython-newspaper

Newspaper3k returns 0 articles from archive.org waybackmachine pages whereas the live page works as expected


When attempting to use the python library newspaper3 on an archived page url from archive.org it fails to fetch any articles. However when using it on the same live page url it works fine. Please see below:

import newspaper

len(newspaper.build('https://bbc.co.uk/news').articles)
>> 111

len(newspaper.build('https://web.archive.org/web/http://www.bbc.co.uk/news').articles)
>>> 0

Even using the special id hack that returns the original modified page doesn't work:

len(newspaper.build('https://web.archive.org/web/20171219030622id_/http://www.bbc.co.uk/news').articles)
    >>> 0

Any help would be much appreciated, thanks!


Solution

  • I find no indication that this library is meant to work with archive.org, or that it works with archive.org.

    Both [1][2] lists of sources contain no mention of either archive.org or web.archive.org.

    I downloaded the entire repository to search the source code, and it also contains no mention of either Internet Archive domain.

    From what I can tell based on this file, the articles attribute is based on RSS/ATOM feeds. I don't think Internet Archive archives those, and even if it does, since they would link back to the live version of the site, some changes in the library itself would be needed to get them to work with Internet Archive.

    You've already opened an issue, where you specify it doesn't work at all (even on single articles -- this is probably an issue elsewhere, such as in the node scoring algorithm it uses to decide which nodes contain the article) so if you don't want to dive into the library source code and fix it yourself, all you can do is wait.