I need to scrape from this website https://www.rbi.org.in/scripts/NewLinkDetails.aspx
This is website that contains news from central bank of India. We need to use playwright for python and asyncio.
The html pattern of this page is following:
Each of this link contains url where need to go and get news.
<a class="link2" href="https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?prid=56032">Governor, Reserve Bank of India meets MD & CEOs of Public and Private Sector Banks</a>
For example if we go to https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?prid=56032 the hmtl structure is following.
We have here tableheader is news title. We need to get it.
<td align="center" class="tableheader"><b>Governor, Reserve Bank of India meets MD & CEOs of Public and Private Sector Banks</b></td>
From this html pattern we need to get the Date tag.
<td align="right" class="tableheader"><b> Date : </b>Jul 11, 2023</td>
From this hmtl pattern we can extract news content. Each p tag contains news content. Therefore we need to get all p from each web url.
<tr class="tablecontent1"><td><table width="100%" border="0" align="center" class="td"> <tbody><tr>
<td><p>The Governor, Reserve Bank of India held meetings with the MD & CEOs of Public Sector Banks and select Private Sector Banks on July 11, 2023 at Mumbai.
The meetings were also attended by Deputy Governors, Shri M. Rajeshwar Rao and Shri Swaminathan J., along with a few senior officials of the RBI. </p>
<p>The Governor in his introductory remarks, while noting the good performance of the Indian banking system despite various adverse global developments.</p>
<p>The issues relating to strengthening of credit underwriting standards, monitoring of large exposures, implementation of External Benchmark Linked Rate (EBLR) Guidelines,
bolstering IT security and IT governance, improving recovery from written-off accounts, and timely and accurate sharing of information with Credit Information Companies
were discussed.</p>
<p align="right"><span class="head">(Yogesh Dayal) </span><br> Chief General Manager</p>
<p class="head">Press Release: 2023-2024/582</p></td> </tr></tbody></table></td> </tr>
I am using this code:
import asyncio
from playwright.async_api import async_playwright
async def scrape_rbi_news():
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await context.new_page()
await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')
# Wait for the page to load and display the links
await page.wait_for_selector('.link2')
# Get all news links
news_links = await page.query_selector_all('.link2')
# Get the first 10 news links
top_10_links = news_links[:10]
for link in top_10_links:
link_url = await link.get_attribute('href')
# Open each news link
await page.goto(link_url)
await asyncio.sleep(2) # Add a delay of 2 seconds for the page to load
try:
# Wait for the title and date elements to be attached to the DOM
await page.wait_for_selector('.tableheader b', timeout=5000)
await page.wait_for_selector('.tableheader b:has-text(" Date : ")', timeout=5000)
# Extract news date using JavaScript evaluation
news_date_element = await page.query_selector('.tableheader b:has-text(" Date : ")')
news_date = await news_date_element.evaluate('(element) => element.nextSibling.textContent')
# Extract news content
news_content_elements = await page.query_selector_all('.tablecontent1 p')
news_content = '\n'.join([await element.inner_text() for element in news_content_elements])
# Print extracted data for each news article
print('URL:', link_url)
print('Date:', news_date.strip())
print('Content:', news_content)
print('---')
except Exception as e:
print('Error:', str(e))
await browser.close()
# Run the scraping function
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_rbi_news())
It print the first news correctly. After that it breaks. I see this error:
playwright._impl._api_types.Error: Element is not attached to the DOM
Any suggestion how to solve this issue?
Your problem is in the line link_url = await link.get_attribute('href')
You are in the index page, then you get the attribute of first link and you navigate to that new link
When you are in the news page, you are trying to do again link_url = await link.get_attribute('href')
that element is not in the page anymore, so you can not get the href of an element that does not exist.
You should save the links into an array before making the loop
Here your script after that change (I did my own selectors because I am not very familiar with CSS, so I used Xpath)
import asyncio
from playwright.async_api import async_playwright
async def scrape_rbi_news():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')
# Wait for the page to load and display the links
await page.wait_for_selector('.link2')
# Get all news links
news_links = await page.locator('.link2').all()
# Get the first 10 news links
top_10_links = news_links[:10]
links = []
# Here we are going to save the link as text, instead of element in order to avoid the problem I commented before
for link_element in top_10_links:
links.append(await link_element.get_attribute('href'))
for link in links:
# Open each news link
await page.goto(link)
await asyncio.sleep(2) # Add a delay of 2 seconds for the page to load
try:
# Wait for the title and date elements to be attached to the DOM
date = await page.locator("(//td[@class='tableheader'])[2]").inner_text()
title = await page.locator("(//td[@class='tableheader']/b)[2]").inner_text()
content = await page.locator("//tr[@class='tablecontent1']//p").all_inner_texts()
content = '\n\n'.join(content)
print('URL:', link)
print(date)
print(title)
print('Content:', content)
print('---')
except Exception as e:
print('Error:', str(e))
await browser.close()
# Run the scraping function
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(scrape_rbi_news())