python web-scraping curl python-requests get

Python requests and cURL never return a response

I am attempting to scrape the NSE website for a particular company, in Python. I am attempting this using the requests library and it's corresponding get() method. This should return a .html file, which I can then use for further processing with Beautiful Soup 4.

url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'

response = requests.get(url)

However, this does not return a response even after a few minutes, and instead made my system crash. On suggestion from the question, Python requests GET takes a long time to respond to some requests,

The server might only allow specific user-agent strings

I added user-agent string to the headers of the request as follows,

url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}

response = requests.get(url, headers=headers)

This method did not remedy the issue either, making it clear to me it is unlikely a performance issue. Instead, I chose to attempt to remedy it using a different method mentioned in the aforementioned question,

IPv6 does not work, but IPv4 does

This was, in fact, the case, as when I attempted to run it in IPv6 mode I got an error.

$ curl --ipv6 -v 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'          
* Could not resolve host: www.nseindia.com
* Closing connection
curl: (6) Could not resolve host: www.nseindia.com

But, IPv4 mode did not fare much better in the case either.

$ curl --ipv4 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS' # -v removed to focus on the error
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

Clearly, the website also does not support HTTP/"# m

2 protocol and it finally ran when run with the mode set to HTTP/1.1.

curl --ipv4 -v 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS' --http1.1
* Host www.nseindia.com:443 was resolved.
* IPv6: (none)
* IPv4: 184.29.25.143
*   Trying 184.29.25.143:443...
* Connected to www.nseindia.com (184.29.25.143) port 443
* ALPN: curl offers http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: C=IN; ST=Maharashtra; L=Mumbai; O=National Stock Exchange of India Ltd.; CN=www.nseindia.com
*  start date: May 28 00:00:00 2024 GMT
*  expire date: May 22 23:59:59 2025 GMT
*  subjectAltName: host "www.nseindia.com" matched cert's "www.nseindia.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=GeoTrust RSA CA 2018
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha1WithRSAEncryption
* using HTTP/1.x
> GET /get-quotes/equity?symbol=20MICRONS HTTP/1.1
> Host: www.nseindia.com
> User-Agent: curl/8.8.0
> Accept: */*
> 
* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing

However, the cURL request still does not complete. I then looked at questions concerning the NSE India website itself, and found the question, Python Requests get returns response code 401 for nse india website,

To access the NSE (api's) site multiple times then set cookies in each subsequent requests

essentially recommending to add cookies to the request.

baseurl = 'https://www.nseindia.com/'
url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
    'accept-language': 'en,gu;q=0.9,hi;q=0.8',
    'accept-encoding': 'gzip, deflate, br'
}

session = requests.Session()
request = session.get(baseurl, headers=headers, timeout=5)
cookies = dict(request.cookies)
response = session.get(url, headers=headers, timeout=5, cookies=cookies)

This solution still faced the same issue, where the request would simply never complete. This question was asked and answered for the NSE India API, and it makes sense that it does not work. I also checked by adding the Accept header according to @GTK's comment,

you need the accept header [...]

url = 'https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS'
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    "Accept": "text/html"
}

response = requests.get(url, headers=headers)

Sadly, this does not impact the problem, perhaps because the Accept header was already set to */* which allows any MIME type.

How can I proceed in this situation, and why is this occurring. Is the request truly that slow, or is there some error occurring?

Solution

I was able to get response from the server with different User-Agent header (the server most probably blacklist some specific user agents):

import requests

url = "https://www.nseindia.com/get-quotes/equity?symbol=20MICRONS"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0",
}

response = requests.get(url, headers=headers)

print(response.text)

Prints:

<!DOCTYPE html>

<html lang="en">

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" />

<title>

    20 Microns Limited Share Price Today, Stock Price, Live NSE News, Quotes, Tips – NSE India

</title>

...