Handling duplicate data when parsing huge XML feeds

I'm writing a component which parses an xml feed with stock quotes and saves the result in a database. The problem is fairly straightforward, except that feed can not be read incrementallly. That is, there is no way to specify that you only want the X last quote changes or only changes newer than X minutes, say. I know the real problem is that the feed is stupid and that the provider should fix their stuff, but that is not an option atm.

The feed is a huge xml file which contains the 100000 last stock quotes for the provider. The feed is polled once every minute during which there is about 50-100 changed quotes. The rest is duplicate quotes which are read again and again and again.

During each poll of the feed, I parse all quotes (using lxml) to objects. Then, for each quote object, I check if the quote already exist in the database. If it does, I discard it and if it doesn't, I save it. This procedure is extremely wasteful since only about 0.1% is new data, there rest is duplicates. To optimize it a bit, I create a lookup table by querying the database once for quotes updated in the last X hours. The quotes are unique in the database on the (last_update, stock_id) key so this optimization reduces the number of queries by about 50%.

But there is still 50k db queries where each quote have to be checked individually if it exists or not which is very taxing on the database.

So what I'm looking for is ideas on how to make my feed parser faster. Maybe there is a way to diff the last fetched xml file with the new one?

Solution

Are the most recent items at the top or the bottom of the feed? If they are at the top then you could stop parsing when you have seen the first item which is already present in the database.

If the most recent items come last you could cache the quote keys and just look them up in memory and start hitting the database once you come to a non cached one. Or you could remember the last quote you put in the database and when parsing all the items you look for it and only hit the database for items after it.