Search code examples
python-3.xseleniumdrop-down-menugeckodriver

Python Selenium Data does not load (website Security)


Please find the code below with which I was trying to download/scrape a "csv" file. The code is the first stage while testing and it fails, even though there is no error. --Data does not load in the gecko driver

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Firefox(executable_path="C:\Py378\prj14\geckodriver.exe")

driver.get("https://www.nseindia.com/market-data/live-equity-market")
time.sleep(5)

element_dorpdown = Select(driver.find_element_by_id("equitieStockSelect"))
element_dorpdown.select_by_index(44)   #Updated with help of @PDHide in the comments
time.sleep(5)

The code executes ok, but the data related to the option does not load due to the security settings of the website, and when I manually select and update the option, table doesnt update, as if there was no selection made. (Maybe its getting to know its selenium driver, and needs headers, but not sure...) Also, when I try to click on "Download in CSV", it gives timeout.

enter image description here

I need to download the csv for F&O, after the option is selected successfully(as shown above)... Please Help...

I can browse through the website on normal browser(installed), but when I use python(selenium) it just fails on those browsers... how to by-pass the security please???


Solution

  • I tried executing the code (using Chrome, but that shouldn't matter) or should I say, a slight variation of it so I could better see what was going on (note that I use implicitly_wait rather than sleep, the latter being wasteful of time). Here I am just trying to select the second option:

    from selenium import webdriver
    from selenium.webdriver.support.ui import Select
    
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(options=options)
    
    try:
        driver.implicitly_wait(3) # wait up to 3 seconds before calls to find elements time out
        driver.get("https://www.nseindia.com/market-data/live-equity-market")
        select = Select(driver.find_element_by_id("equitieStockSelect"))
        select.select_by_index(1)
    finally:
        input('pausing...')
        driver.quit()
    

    As you can see, I have no problem selecting the second option. However, the new table is failing to load:

    At this point I manually issue a reload on the page and I get the results below. My conclusion is that the website is detecting that the browser is being run by automation and is preventing the access:

    Update

    So the data can be retrieved using requests. I used the Chrome inspector to look at network XHR requests and then I selected the second option (NIFTY NEXT 50) and observed what AJAX request was being made:

    enter image description here

    In this case the URL was: https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%20NEXT%2050. However, you have to first fetch the initial page using a requests Session instance:

    import requests
    
    try:
        s = requests.Session()
        headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
        s.headers.update(headers)
        # You have to first retrieve the initial page:
        resp = s.get('https://www.nseindia.com/market-data/live-equity-market')
        resp.raise_for_status()
        #print(resp.text)
        resp = s.get('https://www.nseindia.com/api/equity-stockIndices?index=NIFTY%20NEXT%2050')
        resp.raise_for_status()
        data = resp.json()
        print(data)
    except Exception as e:
        print(e)
    

    Prints:

    {'name': 'NIFTY NEXT 50', 'advance': {'declines': '25', 'advances': '24', 'unchanged': '1'}, 'timestamp': '27-Nov-2020 16:00:00', 'data': [{'priority': 1, 'symbol': 'NIFTY NEXT 50', 'identifier': 'NIFTY NEXT 50', 'open': 30316.45,  etc. (data too long) }
    

    Update 2

    In general, to compute the URL you need to get any index, for example index 44, look at the corresponding option value for that index, in this case 'Securities in F&O' and substitute that for variable option_value in the following program:

    from urllib.parse import quote_plus
    
    option_value = 'SECURITIES IN F&O'
    
    url = 'https://www.nseindia.com/api/equity-stockIndices?index=' + quote_plus(option_value)
    print(url)
    

    Prints:

    https://www.nseindia.com/api/equity-stockIndices?index=SECURITIES+IN+F%26O
    

    The above URL is the value to use.