Search code examples
pythonpython-newspaper

Newspaper library


As an absolute newbie on the topic of using python, I stumbled over a few difficulties using the newspaper library extension. My goal is to use the newspaper extension on a regular basis to download all new articles of a German news website called "tagesschau" and all articles from CNN to build a data stack I can analyze in a few years. If I got it right I could use the following commands to download and scrape all articles into the python library.

import newspaper
from newspaper import news_pool

tagesschau_paper = newspaper.build('http://tagesschau.de')
cnn_paper = newspaper.build('http://cnn.com')

papers = [tagesschau_paper, cnn_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()`

If that's the right way to download all articles, so how I can extract and save those outside of python? Or saving those articles in python so that I can reuse them if I restart python again?

Thanks for your help.


Solution

  • The following codes will save the downloaded articles in HTML format. In the folder, you'll find. tagesschau_paper0.html, tagesschau_paper1.html, tagesschau_paper2.html, .....

    import newspaper
    from newspaper import news_pool
    
    tagesschau_paper = newspaper.build('http://tagesschau.de')
    cnn_paper = newspaper.build('http://cnn.com')
    
    papers = [tagesschau_paper, cnn_paper]
    news_pool.set(papers, threads_per_source=2)
    news_pool.join()
    
    for i in range (tagesschau_paper.size()): 
        with open("tagesschau_paper{}.html".format(i), "w") as file:
        file.write(tagesschau_paper.articles[i].html)
    

    Note: news_pool doesn't get anything from CNN, so I skipped to write codes for it. If you check cnn_paper.size(), it results to 0. You have to import and use Source instead.

    The above codes can be followed as an example to save articles in other formats too, e.g. txt and also only parts that you need from the articles e.g. authors, body, publish_date.