Search code examples
pythonlambdaurllib2

Lambda Python Pool.map and urllib2.urlopen : Retry only failing processes, log only errors


I have an AWS Lambda function which calls a set of URLs using pool.map. The problem is that if one of the URLs returns anything other than a 200 the Lambda function fails and immediately retries. The problem is it immediately retries the ENTIRE lambda function. I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.

This is the code as it currently sits (with some details removed), working only when all URLs are:

from __future__ import print_function
import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

import hashlib
import datetime
import json

print('Loading function')

def lambda_handler(event, context):

  f = urllib2.urlopen("https://example.com/geturls/?action=something");
  data = json.loads(f.read());

  urls = [];
  for d in data:
      urls.append("https://"+d+".example.com/path/to/action");

  # Make the Pool of workers
  pool = ThreadPool(4);

  # Open the urls in their own threads
  # and return the results
  results = pool.map(urllib2.urlopen, urls);

  #close the pool and wait for the work to finish 
  pool.close();
  return pool.join();

I tried reading the official documentation but it seems to be lacking a bit in explaining the map function, specifically explaining return values.

Using the urlopen documentation I've tried modifying my code to the following:

from __future__ import print_function
import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

import hashlib
import datetime
import json

print('Loading function')

def lambda_handler(event, context):

  f = urllib2.urlopen("https://example.com/geturls/?action=something");
  data = json.loads(f.read());

  urls = [];
  for d in data:
      urls.append("https://"+d+".example.com/path/to/action");

  # Make the Pool of workers
  pool = ThreadPool(4);

  # Open the urls in their own threads
  # and return the results
  try:
     results = pool.map(urllib2.urlopen, urls);
  except URLError:
     try:                              # try once more before logging error
        urllib2.urlopen(URLError.url); # TODO: figure out which URL errored
     except URLError:                  # log error
        urllib2.urlopen("https://example.com/error/?url="+URLError.url);

  #close the pool and wait for the work to finish 
  pool.close();
  return true; # always return true so we never duplicate successful calls

I'm not sure if I'm correct to be doing exceptions that way, or if I'm even making python exception notation correctly. Again, my goal is I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.


Solution

  • I figured out the answer thanks to a "lower-level" look at this question I posted here.

    The answer was to create my own custom wrapper to the urllib2.urlopen function, since each thread itself needed to be try{}catch'd instead of the whole thing. That function looked like so:

    def my_urlopen(url):
        try:
            return urllib2.urlopen(url)
        except URLError:
            urllib2.urlopen("https://example.com/log_error/?url="+url)
            return None
    

    I put that above the def lambda_handler function declaration, then I can replace the whole try/catch within it from this:

    try:
       results = pool.map(urllib2.urlopen, urls);
    except URLError:
       try:                              # try once more before logging error
          urllib2.urlopen(URLError.url);
       except URLError:                  # log error
          urllib2.urlopen("https://example.com/error/?url="+URLError.url);
    

    To this:

    results = pool.map(my_urlopen, urls);
    

    Q.E.D.