Search code examples
pythonproxymechanizefile-exists

How to check if remote file exits behind a proxy


I writing app that connect to a web server (I am the owner of he server) sends information provided by the user, process that information and send result back to the application. The time needed to process the results depends on the user request (from few seconds to a few minutes).

I use a infinite loop to check if the file exist (may be there is a more intelligent approach... may be I could estimated the maximum time a request could take and avoid using and infinite loop)

the important part of the code looks like this

import time
import mechanize

br = mechanize.Browser()
br.set_handle_refresh(False)
proxy_values={'http':'proxy:1234'}
br.set_proxies(proxy_values)


While True:
    try:
        result=br.open('http://www.example.com/sample.txt').read()
        break
    except:
        pass
time.sleep(10)

Behind a proxy the loop never ends, but if i change the code for something like this,

time.sleep(200)
result=br.open('http://www.example.com/sample.txt').read()

i.e. I wait enough time to ensure that the file is created before trying to read it, I indeed get the file :-)

It seems like if mechanize ask for a file that does not exits everytime mechanize will ask again I will get no file...

I replicated the same behavior using Firefox. I ask for a non-existing file then I create that file (remember I am the owner of the server...) I can not get the file. And using mechanize and Firefox I can get deleted files...

I think the problem is related to the Proxy cache, I think I can´t delete that cache, but may be there is some way to tell the proxy I need to recheck if the file exists...

Any other suggestion to fix this problem?


Solution

  • The simplest solution could be to add a (unused) GET parameter to avoid caching the request.

    ie:

    i = 0
    While True:
        try:
            result=br.open('http://www.example.com/sample.txt?r=%d' % i).read()
            break
        except:
            i += 1
        time.sleep(10)
    

    The extra parameter should be ignored by the web application.

    A HTTP HEAD is probably the correct way to do this, see this question for a example.