We work with a lot of real estate, and while rearchitecting how the data is imported, I came across an interesting issue.
Firstly, the way our system works (loosely speaking) is we run a Coldfusion process once a day that retrieves data provided from an IDX vendor via FTP. They push the data to us. Whatever they send us is what we get.
Over the years, this has proven to be rather unstable.
I am rearchitecting it with PHP on the RETS standard, which uses SOAP methods of retrieving data, which is already proven to be much better than what we had.
When it comes to 'updating' existing data, my initial thought was to query only for data that was updated. There is a field for 'Modified' that tells you when a listing was last updated, and the code I have will grab any listing updated within the last 6 hours (give myself a window in case something goes wrong).
However, I see a lot of real estate developers suggest creating 'batch' processes that run through all listings regardless of updated status that is constantly running.
Is this the better way to do it? Or am I fine with just grabbing the data I know I need? It doesn't make a lot of sense to me to do more processing than necessary. Thoughts?
If you can trust the MODIFIED, you approach is good enough and faster than parsing the whole file. If you are looping the whole find trying to find what is modified or not it will be harder.
There will a slight performance improvement if you can send the file directly to the Database and let it import with its built in file import functions. For example MySQL has this - http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html I am sure other DB like SQL Server and Oracle have such mechanisms too. This will save you the overhead of the programming language work and only use the Database resources.