Search code examples
phpmysqlqueuecycle

PHP + MySQL Cycle Queue


I just accepted a similar question (PHP + MySQL Queue), but I realized that it wasn't the correct question for my problem, but was the correct answer for my question :)

I have a MySQL (MyISAM type) table of sites to be scraped by workers.

CREATE TABLE `site` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `last_pop` int(13) default NULL,
  `md5` varchar(32) default NULL,
  `disabled` tinyint(1) default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

What I need is to scrap one site per worker without repeat. So, if I have 3 sites and 2 workers the system needs to work like this:

      ID URL   LAST_POP
t4    1  site1 t1         <- worker1 scrap site1
t4    2  site2 t2         <- worker2 scrap site2
t5    3  site3 t3         <- worker1 scrap site3
t6    1  site2 t4         <- worker2 scrap site2
t6    2  site1 t4         <- worker1 scrap site1
t7    3  site3 t5         <- worker2 scrap site3
....

It's like a cycled queue orderer by last_pop ASC.

How can I do that?


Solution

  • You'll probably want to keep track of two pieces of information for each site: when it was last scraped and if it is currently being scraped.

    Using the answer for your other question, set the scraping field to the id of the worker to lock it from other workers. When the worker is finished with the task set the scraping field back to null and the last_scrape date to the current time.

    CREATE TABLE `site` (
      `id` int(11) NOT NULL auto_increment,
      `url` text,
      `last_scrape` TIMESTAMP,
      `scraping` tinyint(1) default NULL,
      `md5` varchar(32) default NULL,
      `disabled` tinyint(1) default '0',
      PRIMARY KEY  (`id`),
      UNIQUE KEY `md5` (`md5`),
    ) ENGINE=MyISAM  DEFAULT CHARSET=latin1;
    

    Lock and retrieve the next job (the site that was last scraped the longest time ago):

    Update site
      set `scraping` = '$worker_id' 
      where `scraping` is null 
      order by `last_scrape` ASC limit 1;
    
    $job = 
      Select * from site
      where `scraping` = '$worker_id'
    

    Release job back to queue:

    Update site
      set `scraping` = NULL,
      `last_scrape` = NOW()
      where `scraping` = '$worker_id';