Why ConnectionError when GET requests gz data using Python requests

I am trying to request the batch log-level data from Appnexus api. Based on the official data service guide, there are four main steps:

1. Account Authentication -> return token in Json

2. GET available data feeds list and look up download parameters -> return parameters in Json

3. GET Request file download location code by passing download parameters -> extract location code from header

4. GET download log data file by passing location code -> return gz data file

Those steps work perfectly in Terminal using curl:

curl -b cookies -c cookies -X POST -d @auth 'https://api.appnexus.com/auth'
curl -b cookies -c cookies 'https://api.appnexus.com/siphon?siphon_name=standard_feed'
curl --verbose -b cookies -c cookies 'https://api.appnexus.com/siphon-download?siphon_name=standard_feed&hour=2017_12_28_09&timestamp=20171228111358&member_id=311&split_part=0'
curl -b cookies -c cookies 'http://data-api-gslb.adnxs.net/siphon-download/[location code]' > ./data_download/log_level_feed.gz

In Python, I was trying same thing to test the api. However, it keeps giving me the "ConnectionError". In steps 1-2, it still works well so that I successfully got the parameters from the Json response to build the url for step 3 in which i need to request location code and extract it from the response's header.

Step1:

# Step 1
############ Authentication ###########################    
# Select End-Point
auth_endpoint = 'https://api.appnexus.com/auth'

# API Key
auth_app = json.dumps({'auth':{'username':'xxxxxxx','password':'xxxxxxx'}})

# Proxy
proxy = {'https':'https://proxy.xxxxxx.net:xxxxx'}
r = requests.post(auth_endpoint, proxies=proxy, data=auth_app)
data = json.loads(r.text)
token = data['response']['token']

Step2:

# Step 2
########### Check report list ###################################
check_list_endpoint = 'https://api.appnexus.com/siphon?siphon_name=standard_feed'
report_list = requests.get(check_list_endpoint, proxies=proxy, headers={"Authorization":token})
data = json.loads(report_list.text)
print(str(len(data['response']['siphons'])) + ' previous hours available for download')

# Build url for single report - extract para
download_endpoint = 'https://api.appnexus.com/siphon-download'
siphon_name = 'siphon_name=standard_feed' 
hour = 'hour=' + data['response']['siphons'][400]['hour']
timestamp = 'timestamp=' + data['response']['siphons'][400]['timestamp'] 
member_id = 'member_id=311' 
split_part = 'split_part=' + data['response']['siphons'][400]['splits'][0]['part']

# Build url
download_endpoint_url = download_endpoint + '?' + \
siphon_name + '&' + \
hour + '&' + \
timestamp + '&' + \
member_id + '&' + \
split_part
# Check
print(download_endpoint_url)

Yet, instead of running to complete, the "requests.get" in the following step 3 keeps giving "ConnectionError" warning. In addition, I found that the "location code" is actually in the warning information which is right after "/siphon-download/". So, i use "try..except" to extract it from the warning message and keep the code running.

Step3:

# Step 3
######### Extract location code for target report ####################
try:
    TT = requests.get(download_endpoint_url, proxies=proxy, headers={"Authorization":token}, timeout=1)
except ConnectionError, e:
    text = e.args[0].args[0]
    m = re.search('/siphon-download/(.+?) ', text)
    if m:
        location = m.group(1)
print('Successfully Extracting location: ' + location)

Original warning message without "try..except" in Step3:

ConnectionError: HTTPConnectionPool(host='data-api-gslb.adnxs.net', port=80): Max retries exceeded with url: 
/siphon-download/dbvjhadfaslkdfa346583 
(Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000000007CBC7B8>: 
Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not 
properly respond after a period of time, or established connection failed because connected host has failed to respond',))

Then, I was trying to make the last GET request with location code that i extracted from previous warning message to download the gz data file as i did using "curl" in terminal. However, I have got the same warning message - ConnectionError.

Step4:

# Step 4
######## Download data file #######################
extraction_location = 'http://data-api-gslb.adnxs.net/siphon-download/' + location
LLD = requests.get(extraction_location, proxies=proxy, headers={"Authorization":token}, timeout=1)

Original warning message in Step4:

ConnectionError: HTTPConnectionPool(host='data-api-gslb.adnxs.net', port=80): Max retries exceeded with url: 
/siphon-download/dbvjhadfaslkdfa346583 
(Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000000007BE15C0>: 
Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not 
properly respond after a period of time, or established connection failed because connected host has failed to respond',))

To double check, I tested all the endpoints, parameters, and location code generated in my Python script in terminal using curl. They all work fine and the data downloaded is correct. Can anybody help me solve this issue in Python or point me at the right direction to discover why this is happening? Many thanks!

Solution

1) In curl you are reading and writing cookies (-b cookies -c cookies). With requests you are not using session objects http://docs.python-requests.org/en/master/user/advanced/#session-objects so your cookie data is lost.

2) You define a https proxy and then you are trying to connect over http with no proxy (to data-api-gslb.adnxs.net). Define both http and https, but only once on the session object. See http://docs.python-requests.org/en/master/user/advanced/#proxies. (This is probably the root cause of the error message you see.)

3) Requests handles redirects automatically there is no need to extract the location header and use it in the next request, it will automatically be redirected. So there are 3 steps not 4 when the other errors are fixed. (This also answers Hetzroni's question in the comments above.)

So use

s = requests.Session() 
s.proxies = {
               'http':'http://proxy.xxxxxx.net:xxxxx',
               'https':'https://proxy.xxxxxx.net:xxxxx'
             } # set this only once using valid proxy urls.

then use

s.get()

and

s.post()

instead of

requests.get()

and

requests.post()