Basically, i want my site to aggregate a lot of rss feeds and store them in database during cron job. i use magpie to parse the rss into arrays...everything should seem straight forward although im worried about duplication issues when running the cron job.
what is the best solution to avoid duplicate entries.... here is my theory although i dnt think its efficient.
cron job theory
1) parse rss feed with magpie 2) create md5 hash of link 3) test for existance of md5 in database table... if not ... insert .. if exists ignore or update
lemme know if there is a more efficient way
Links may not be the enough because articles are duplicated on several sites. I once made a system to collect articles from a lot of newspapers where the same article can appear in multiple sources. Also a site may publish the same article on multiple URL's, for example when an article is presented in multiple categories.
If you really want to be sure an article is not a duplicate, compare the content or a hashed code based on it.