python heroku beautifulsoup python-requests reverse-proxy

Weird bugs on Heroku production deployment with requests and proxies

I have made an application that uses proxies and checks for the unindexed websites using requests module in python. I scrape the google result page, www.google.com/search?site:{url}&num=3 and check for a specific phrase when google cannot find that specific site!

# checking logic
            response = self.proxy_request(INDEXING_SEARCH_STRING.format(current_url))
            if response.status_code != 200:
                return current_url, False, "failed"
            soup = bs4.BeautifulSoup(response.text, "html.parser")
            not_indexed_regex = re.compile("did not match any documents")
            if soup(text=not_indexed_regex):
                return current_url, False, "checked"
            else:
                print(response.text)
                return current_url, True, "checked"

# proxy requests
    def proxy_request(self, url, **kwargs):
        fail_count = 0
        max_failures = 3  # Adjust this threshold as needed
        print("Evaluating: ", self.url_manager.current_url_index, "URL: ", url)
        while fail_count < max_failures:
            current_proxy = self.proxy_manager.get_proxy_for_request()

            if current_proxy is None:
                ProgressManager.update_progress("All given proxy failed")
                return requests.get(url, **kwargs)
            
            try:
                response = requests.get(url, proxies=current_proxy, timeout=20)
                if response.status_code == 200:
                    print("Success!")
                    self.proxy_manager.update_proxy()
                    return response
                else:
                    print("Failed!",response.status_code)
                    ProgressManager.update_progress("Proxy failing with status code: " + str(response.status_code))
                    time.sleep(0.5)
                    self.proxy_manager.update_proxy()
            except Exception as e:
                print("Failed!", e)
                fail_count += 1
                self.proxy_manager.update_proxy()
                ProgressManager.update_progress(f"Request failed! {e.__class__.__name__}. ")
                break
        time.sleep(5)
        return requests.get(url,timeout=20)

It works completely fine with/without proxies on my local machine. But when I deploy it on Heroku, it marks some sites as True,"checked" when they are not indexed, which were correctly handled by the same app running on my device.

However, when proxies are not given, it works correctly, the bugs arises when the proxies are submitted to it.

Also if there are any other easier methods to bypass H-12 Time out error for long running process which doesn't require any additional servers running, please let me know.

It works on localhost so I am unable to debug the deployment effectively. Also sometimes there is a error with proxy HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=site:{{URL}}/&num=1 (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 401 Auth Failed ip_blacklisted: 3.85.57.0/24'))) how to resolve this?

Solution

I found out that the results are given in different languages. Hence the specified pattern did not match any documents may or may not appear.

A simple solution is to use modified google query, www.google.com/search?site:{url}&num=3&hl=en the hl=en part will force google to return the page in English.