mysql multithreading perl multiprocessing children

Perl, Children, and Shared Data

I'm working with a database that holds lots of urls (tens of thousands). I'm attempting to multi-thread a resolver, that simply tries to resolve a given domain. On success, it compares the result to what's currently in the database. If it's different, the result is updated. If it fails, it's also updated.

Naturally, this will produce an inordinate volume of database calls. To clarify some of my confusion about the best way to achieve some form of asynchronous load distribution, I have the following questions (being fairly new to Perl still).

What is the best option for distributing the workload? Why?
How should I gather the urls to resolve prior to spawning?
- Creating a hash of domains with the data to be compared seems to make the most sense to me. Then split it up, fire up children, children return changes to be made to parent
How should returning data to the parent be handled in a clean manner?

I've been playing with a more pythonic method (given that I have more experience in Python), but have yet to make it work due to a lack of blocking for some reason. Asside from that issue, threading isn't the best option simply due to (a lack of) CPU time for each thread (plus, I've been crucified more than once in the Perl channel for using threads :P and for good reason)

Below is more or less psuedo-code that I've been playing with for my threads (which should be used more as a supplement to my explanation of what I'm trying to accomplish, than anything).

# Create children...
for (my $i = 0; $i < $threads_to_spawn; $i++ )
{
    threads->create(\&worker);
}

The parent then sits in a loop, monitoring a shared array of domains. It locks and re-populates it if it becomes empty.

Solution

Your code is the start of a persistent worker model.

use threads;
use Thread::Queue 1.03 qw( );

use constant NUM_WORKERS => 5;

sub work {
   my ($dbh, $job) = @_;
   ...
}

{
   my $q = Thread::Queue->new();

   for (1..NUM_WORKERS) {
      async {
         my $dbh = ...;
         while (my $job = $q->dequeue()) 
            work($dbh, $job);
         }
      };
   }

   for my $job (...) {
      $q->enqueue($job);
   }

   $q->end();
   $_->join() for threads->list();
}

Performance tips:

Tweak the number of workers for your system and workload.
Grouping small jobs into larger jobs can improve speed by reducing overhead.