We have 10,000s of blogs we want to check multiple times a day for new posts. I'd love some ideas with example code on the most efficient way to do this using Perl.
Currently we are just using LWP::UserAgent to download each RSS feed and then checking each URL in the resulting feed against a MySQL database table of already found URLs one at a time. Needless to say this doesn't scale well and is super inefficient.
Thanks in advance for your help & advice!
Unfortunately, there is probably no other way than do some kind of polling.
Luckily, implementing the PubSubHubbub protocol can greatly help reduce the amount of polling for the feeds who support it.
For those feeds who don't support PubSubHubbub, then you'll have to make sure you use HTTP-level protocols (like ETags
or If-Modified-Since
headers to know if/when a resource has been updated).
Also make sure you implement some kind of back-off mechanisms.