Search code examples
web-scrapingcookiespython-requestsweb-crawler

How to deal with Dynamic cookies when web crawling


I am trying to access data(quote value) from an E-commerce website using the 'requests' library in python. The problem I have is that the cookies in the website are dynamic. And my code requires a header to get a response. I can open the website and scrape it but to do that I need to copy the header details from the response header. However I need to automate this process so that I don't need to manually put the cookie in every time I want to scrape. This the link "https://www.nseindia.com/get-quotes/equity?symbol=RELIANCE". I am trying to get the 'Intraday chart' data so I can store it in a DataFrame and plot it.

I am a beginner and I have never web scraped before.

This is what I have tried so far.

import requests
import pandas as pd

# I take this data from the website response headers

headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Accept-Language': 'en-US,en;q=0.5',
    'Cookie' : 'Cookie Value'
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest'
}

response = requests.get(url = 'https://www.nseindia.com/api/chart-databyindex?index=RELIANCEEQN', headers = headers)

response.json()['grapthData']

reliance = pd.DataFrame(response.json()['grapthData'])

reliance.columns = ['Timestamp', 'Price']

reliance['Timestamp'] = pd.to_datetime(reliance['Timestamp'], unit = 'ms' )

reliance.plot(x = 'Timestamp', y = 'Price')

Solution

  • First approach

    To scrape data from a website with dynamic cookies, you'll need to handle cookies and headers automatically. One way to achieve this is by using the requests library along with requests.Session to manage and persist cookies across multiple requests. Additionally, you can use the BeautifulSoup library to parse the HTML content if necessary.

    Here's a more automated approach to handling dynamic cookies and headers:

    • Use requests.Session to maintain a session.
    • Perform an initial request to get the dynamic cookies.
    • Use these cookies to make the subsequent requests.

    Example:

    import requests
    
    # Initialize a session to handle cookies
    session = requests.Session()
    
    # Initial request to get the dynamic cookies
    url = 'https://www.nseindia.com/get-quotes/equity?symbol=RELIANCE'
    initial_response = session.get(url)
    
    # Now make the actual request to get the intraday chart data
    data_url = 'https://www.nseindia.com/api/chart-databyindex?index=RELIANCEEQN'
    response = session.get(data_url)
       
    

    Second approach

    Using Selenium to automate the browser and extract cookies can be an effective approach to handle dynamic cookies. Here’s how you can use Selenium to open the webpage, retrieve the cookies, and then use these cookies in the requests library to fetch the required data.

    pip install requests pandas selenium
    

    Example:

    from selenium import webdriver
    import requests
    import pandas as pd
    import time
    
    
    # Initialize the Selenium WebDriver
    driver = webdriver.Chrome()
    
    # Open the URL using Selenium
    url = 'https://www.nseindia.com/get-quotes/equity?symbol=RELIANCE'
    driver.get(url)
    time.sleep(5)
    
    # Extract cookies from the Selenium browser session
    cookies = driver.get_cookies()
    
    # Close the Selenium browser
    driver.quit()
    
    # Create a dictionary of cookies for the requests session
    cookies_dict = {cookie['name']: cookie['value'] for cookie in cookies}
    
    # Initialize a session to handle cookies
    session = requests.Session()
    
    # Set the headers for the session
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate, br, zstd',
        'Accept-Language': 'en-US,en;q=0.5',
        'X-Requested-With': 'XMLHttpRequest'
    })
    
    # Set the cookies for the session
    session.cookies.update(cookies_dict)
    
    # Make the actual request to get the intraday chart data
    data_url = 'https://www.nseindia.com/api/chart-databyindex?index=RELIANCEEQN'
    response = session.get(data_url)
    

    Explanation:

    1. Selenium Setup: Initialize a Selenium WebDriver instance with Chrome
    2. Open URL: Navigate to the target URL using Selenium.
    3. Extract Cookies: Retrieve cookies from the Selenium session.
    4. Session Management: Use requests.Session to maintain session state, including the cookies extracted by Selenium.
    5. Data Request: Make the request to the API endpoint using the session with the correct headers and cookies.

    This approach combines the automation capabilities of Selenium with the simplicity and efficiency of the requests library for making HTTP requests, ensuring you can handle dynamic cookies without manual intervention.