Search code examples
pythonweb-scrapingrssgoogle-news

How to web scrape google news headline of a particular year (e.g. news from 2020)


I've been exploring web scraping techniques using Python and RSS feed, but I'm not sure how to narrow down the search results to a particular year on Google News. Ideally, I'd like to retrieve headlines, publication dates, and possibly summaries for news articles from a specific year (such as 2020). With the code provided below, I can scrape the current data, but if I try to look for news from a specific year, it isn't available. Even when I use the Google articles search box, the filter only shows results from the previous year. However, when I scroll down, I can see articles from 2013 and 2017. Could someone provide me with a Python script or pointers on how to resolve this problem?

Here's what I've attempted so far:

import feedparser
import pandas as pd
from datetime import datetime

class GoogleNewsFeedScraper:
    def __init__(self, query):
        self.query = query

    def scrape_google_news_feed(self):
        formatted_query = '%20'.join(self.query.split())
        rss_url = f'https://news.google.com/rss/search?q={formatted_query}&hl=en-IN&gl=IN&ceid=IN%3Aen'
        feed = feedparser.parse(rss_url)
        titles = []
        links = []
        pubdates = []

        if feed.entries:
            for entry in feed.entries:
                # Title
                title = entry.title
                titles.append(title)
                # URL link
                link = entry.link
                links.append(link)
                # Date
                pubdate = entry.published
                date_str = str(pubdate)
                date_obj = datetime.strptime(date_str, "%a, %d %b %Y %H:%M:%S %Z")
                formatted_date = date_obj.strftime("%Y-%m-%d")
                pubdates.append(formatted_date)

        else:
            print("Nothing Found!")

        data = {'URL link': links, 'Title': titles, 'Date': pubdates}
        return data

    def convert_data_to_csv(self):
        d1 = self.scrape_google_news_feed()
        df = pd.DataFrame(d1)
        csv_name = self.query + ".csv"
        csv_name_new = csv_name.replace(" ", "_")
        df.to_csv(csv_name_new, index=False)


if __name__ == "__main__":
    query = 'forex rate news'
    scraper = GoogleNewsFeedScraper(query)
    scraper.convert_data_to_csv()

Solution

  • You can use date filters in your rss_url. modify the query part in the below format

    Format: q=query+after:yyyy-mm-dd+before:yyyy-mm-dd

    Example: https://news.google.com/rss/search?q=forex%20rate%20news+after:2023-11-01+before:2023-12-01&hl=en-IN&gl=IN&ceid=IN:en

    The URL above returns articles related to forex rate news that were published between November 1st, 2023, and December 1st, 2023.

    Please refer to this article for more information.