I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc...
I want them to progress more or less at the same pace, but more often than not one of them becomes starved and another one consumes all resources.
I've tried following approaches (simplified). Single page crawling is done by the following actor:
class Crawler extends Actor {
def receive = {
case Crawl(url, kind) =>
// download url
// extract links
sender ! Parsed(url, links, kind)
}
}
Approach 1:
class Coordinator extends Actor {
val linksA = ...
val linksB = ...
def receive = {
case Parsed(url, links, kind) =>
val store = if (kind == kindA) linksA else linksB
val newLinks = links -- store
store ++= links
newLinks.foreach { link =>
val crawler = context.actorOf(Props[Crawler])
crawler ! Crawl(link, kind)
}
}
}
Approach 2:
class Coordinator extends Actor {
val linksA = ...
val linksB = ...
val rrProps = Props[Crawler].withRouter(RoundRobinRouter(nrOfInstances = 10)
val crawlerA = context.actorOf(rrProps)
val crawlerB = context.actorOf(rrProps)
def receive = {
case Parsed(url, links, kind) =>
val store = if (kind == kindA) linksA else linksB
val newLinks = links -- store
store ++= links
newLinks.foreach { link =>
if (kind == kindA) crawlerA ! Crawl(link, kind)
else crawlerB ! Crawl(link, kind)
}
}
}
Second approach made things slightly better, but didn't fix it whole.
Is there a good way to make crawlers of both kinds progress more or less at the same pace? Should I send messages between them unblocking each other in turn?
I'm working on a similar program where the workers have a non-uniform resource cost (in my case the task is performing database queries and dumping the results in another database, but just as crawling different websites will have different costs so too will different queries have different costs). Two ways of dealing with this that I've employed:
RoundRobinRouter
with a SmallestMailboxRouter
Coordinator
send out all of its messages at once - instead send them out in batches, in your case you have ten workers so sending out forty messages should keep them busy initially. Whenever a worker completes a task it sends a message to the Coordinator
, at which point the Coordinator
sends out another message that will probably go to the worker that just completed its task. (You can also do this in batches, i.e. after receiving n
"task complete" messages the Coordinator
sends out another n
messages, but don't make n
too high or else some workers with extremely short tasks may be idle.)A third option is to cheat and share a ConcurrentLinkedQueue
between all actors: after filling the queue the Coordinator
sends a "start" message to the workers, and the workers then poll the queue til it's empty.