Search code examples
phpmysqlrssaggregationsimplepie

Best Practice for Skipping Duplicate Entries in MySQL


I have written a feed aggregator before but am trying to optimize it a bit. In the past, using simplepie (php class) to parse the feeds, I have used the get_id() function for each feed item to return a hash (an md5 mix of link + title). I store this "id" as the "remote_id" in MySQL. However to ensure that I have no duplicates I've been doing a SELECT query for each feed item to ensure that the "remote_id" does not exist. This seems inefficient considering I am looking at 1000's of feeds.

Is it most efficient to just turn remote_id into a unique key and then let the database fail to write the new record on each pass? Any other way to engineer this that is better?


Solution

  • Yes, if a key should be unique in mysql, it's generally a good idea to define it as a unique key.

    When inserting possible duplicates you may use PDO and try {} catch () {} statements to filter them out, they will throw an exception. You won't have to check beforehand.

    I use something like this in a similar situation (pseudocode alert):

            $stmnt = $this->dbh->prepare('INSERT QUERY');  
    
            try {
                $this->dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
                $this->dbh->beginTransaction();
    
                $stmnt->execute($celss);
    
                $this->dbh->commit();
            } catch (Exception $e)
            {
                $this->dbh->rollback();
                $this->invalidRows[] = array($cells,$e->getMessage());
                continue;
            }