Search code examples
pythonweb-scrapingbeautifulsoupnewspaper3k

Scraping the news titles from news websites


I've been trying to scrape news titles from the news websites. For that I've come across two python libraries i.e newspaper and beautifulsoup4. Using the beautiful soup library, I've been able to get all the links from a particular news website that lead to news articles. From the code below I've been able to extract the title of a news article from a single link.

from newspaper import Article
url= "https://www.ndtv.com/india-news/tamil-nadu-government-reverses-decision-to-reopen-schools-from-november-16-for-classes-9-12-news-agency-pti-2324199"
article=Article(url)
article.download()
article.parse()
print(article.title)

I want to combine the code from both the libraries i.e, newspaper and beautifulsoup4, such that all the links that I get as an output from beautifulsoup library, should be placed in the url command in the newspaper library and I get all the titles of the links. Below is the code of beautfulsoup from which I've been able to extract all the links to the news articles.

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

Solution

  • Do you mean something like this?

    links = []
    for link in soup.find_all('a', href=True):
        links.append(link['href'])
    
    for link in links:
        article=Article(link)
        article.download()
        article.parse()
        print(article.title)