Search code examples
pythonweb-scrapingplaywright

Web Scraping News Articles Python


I need to scrape from this website https://www.rbi.org.in/scripts/NewLinkDetails.aspx

This is website that contains news from central bank of India. We need to use playwright for python and asyncio.

The html pattern of this page is following:

Each of this link contains url where need to go and get news.

<a class="link2" href="https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?prid=56032">Governor, Reserve Bank of India meets MD &amp; CEOs of Public and Private Sector Banks</a>

For example if we go to https://www.rbi.org.in/Scripts/BS_PressReleaseDisplay.aspx?prid=56032 the hmtl structure is following.

We have here tableheader is news title. We need to get it.

<td align="center" class="tableheader"><b>Governor, Reserve Bank of India meets MD &amp; CEOs of Public and Private Sector Banks</b></td>

From this html pattern we need to get the Date tag.

<td align="right" class="tableheader"><b> Date : </b>Jul 11, 2023</td>

From this hmtl pattern we can extract news content. Each p tag contains news content. Therefore we need to get all p from each web url.

<tr class="tablecontent1"><td><table width="100%" border="0" align="center" class="td">  <tbody><tr>    
<td><p>The Governor, Reserve Bank of India held meetings with the MD &amp; CEOs of Public Sector Banks and select Private Sector Banks on July 11, 2023 at Mumbai. 
The meetings were also attended by Deputy Governors, Shri M. Rajeshwar Rao and Shri Swaminathan J., along with a few senior officials of the RBI. </p>     
<p>The Governor in his introductory remarks, while noting the good performance of the Indian banking system despite various adverse global developments.</p>    
 <p>The issues relating to strengthening of credit underwriting standards, monitoring of large exposures, implementation of External Benchmark Linked Rate (EBLR) Guidelines,
 bolstering IT security and IT governance, improving recovery from written-off accounts, and timely and accurate sharing of information with Credit Information Companies 
 were discussed.</p>     
 <p align="right"><span class="head">(Yogesh Dayal)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><br>      Chief General Manager</p>    
 <p class="head">Press Release: 2023-2024/582</p></td>  </tr></tbody></table></td> </tr>

I am using this code:

import asyncio
from playwright.async_api import async_playwright

async def scrape_rbi_news():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        context = await browser.new_context()

        page = await context.new_page()
        await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')

        # Wait for the page to load and display the links
        await page.wait_for_selector('.link2')

        # Get all news links
        news_links = await page.query_selector_all('.link2')

        # Get the first 10 news links
        top_10_links = news_links[:10]

        for link in top_10_links:
            link_url = await link.get_attribute('href')

            # Open each news link
            await page.goto(link_url)
            await asyncio.sleep(2)  # Add a delay of 2 seconds for the page to load

            try:
                # Wait for the title and date elements to be attached to the DOM
                await page.wait_for_selector('.tableheader b', timeout=5000)
                await page.wait_for_selector('.tableheader b:has-text(" Date : ")', timeout=5000)

                # Extract news date using JavaScript evaluation
                news_date_element = await page.query_selector('.tableheader b:has-text(" Date : ")')
                news_date = await news_date_element.evaluate('(element) => element.nextSibling.textContent')

                # Extract news content
                news_content_elements = await page.query_selector_all('.tablecontent1 p')
                news_content = '\n'.join([await element.inner_text() for element in news_content_elements])

                # Print extracted data for each news article
                print('URL:', link_url)
                print('Date:', news_date.strip())
                print('Content:', news_content)
                print('---')
            except Exception as e:
                print('Error:', str(e))

        await browser.close()

# Run the scraping function
if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(scrape_rbi_news())

It print the first news correctly. After that it breaks. I see this error:

playwright._impl._api_types.Error: Element is not attached to the DOM

Any suggestion how to solve this issue?


Solution

  • Your problem is in the line link_url = await link.get_attribute('href')

    You are in the index page, then you get the attribute of first link and you navigate to that new link

    When you are in the news page, you are trying to do again link_url = await link.get_attribute('href') that element is not in the page anymore, so you can not get the href of an element that does not exist.

    You should save the links into an array before making the loop

    Here your script after that change (I did my own selectors because I am not very familiar with CSS, so I used Xpath)

    import asyncio
    from playwright.async_api import async_playwright
    
    async def scrape_rbi_news():
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=False)
            context = await browser.new_context()
    
            page = await context.new_page()
            await page.goto('https://www.rbi.org.in/scripts/NewLinkDetails.aspx')
    
            # Wait for the page to load and display the links
            await page.wait_for_selector('.link2')
    
            # Get all news links
            news_links = await page.locator('.link2').all()
    
            # Get the first 10 news links
            top_10_links = news_links[:10]
            links = []
            #  Here we are going to save the link as text, instead of element in order to avoid the problem I commented before
            for link_element in top_10_links:
                links.append(await link_element.get_attribute('href'))
    
            for link in links:
                # Open each news link
                await page.goto(link)
                await asyncio.sleep(2)  # Add a delay of 2 seconds for the page to load
    
                try:
                    # Wait for the title and date elements to be attached to the DOM
                    date = await page.locator("(//td[@class='tableheader'])[2]").inner_text()
                    title = await page.locator("(//td[@class='tableheader']/b)[2]").inner_text()
                    content = await page.locator("//tr[@class='tablecontent1']//p").all_inner_texts()
                    content = '\n\n'.join(content)
    
                    print('URL:', link)
                    print(date)
                    print(title)
                    print('Content:', content)
                    print('---')
                except Exception as e:
                    print('Error:', str(e))
    
            await browser.close()
    
    # Run the scraping function
    if __name__ == '__main__':
        loop = asyncio.get_event_loop()
        loop.run_until_complete(scrape_rbi_news())