Search code examples
javaspringreactive-programmingproject-reactor

Reactive web-crawler with limited concurrent request to the same domain


I'm working on an open-source web crawling project. I noticed that the application occasionally flood the websites it's crawling with requests (I get back 429 Too Many Requests). Because of this, I want to limit the concurrent request count to one with a one-second delay between requests for the same domain.

I figured out this code to do that:

Flux.generate(downloaderQueueConsumer)
    .doFirst(this::initializeProcessing)
    .flatMap(this::evaluateDocumentLocation)
    .groupBy(this::parseDocumentDomain, 100000)
    .flatMap(documentSourceItem1 -> documentSourceItem1
            .delayElements(Duration.ofSeconds(1))
            .doOnNext(this::incrementProcessedCount)
            .flatMap(this::downloadDocument)
            .flatMap(this::archiveDocument)
            .doOnNext(this::incrementArchivedCount)
    )
    .doFinally(this::finishProcessing)
    .subscribe();

My problem with this code is that it doesn't limit parallel request count to a domain to one. Is there a way to achieve that?


Solution

  • You'd probably need to maintain some sort of state external to the Flux if you wanted to do it this way - there's no obvious way to store and alter this sort of mutable data within the Flux itself.

    That being said, this isn't the approach I'd recommend for rate limiting - I've instead done something similar to the following which is a nicer and more robust solution:

    • Map a 429 status code to a "rate limit" exception (you'll likely need to define this exception type yourself)
    • Pull in reactor-extra, then use Retry to use exponential backoff with jitter (or whatever backoff strategy you prefer.)

    This will give you more control over your specific retry strategy as well as likely making your code more readable.