I have a (huge: 6.4 million rows) database table with three columns as follows:
URL Title Description
Right now only the URL column is populated, with various urls from around the web. I have to add a title and meta description to the url in each row. There are three ways I see I can do this:
Option 1: Use php and CURL to go to each url and grab the title and meta:description. (Even with using options 2 or 3 I'll have to do this for at least part of the urls).
Option 2: I have a DMOZ xml file which has several million elements which look like this:
<ExternalPage about="http://animation.about.com/">
<d:Title>About.com: Animation Guide</d:Title>
<d:Description>Keep up with developments in online animation for all skill levels. Download tools, and seek inspiration from online work.</d:Description>
<topic>Top/Arts/Animation</topic>
</ExternalPage>
<ExternalPage about="http://www.toonhound.com/">
<d:Title>Toonhound</d:Title>
<d:Description>British cartoon, animation and comic strip creations - links, reviews and news from the UK.</d:Description>
<topic>Top/Arts/Animation</topic>
</ExternalPage>
etc.
I can use xmlreader to compare the url in every ExternalPage-about to see if it matches one of the 6.4 million urls in my table, and if it does, add the title and description.
Option 3: I can write all the above dmoz data into a separate database table (again using xmlreader), which will also take time, and then somehow use JOINs and SELECTs on the two tables to get the information I need.
Would option 2 or 3 be quicker for those urls than just using option 1 for all the urls? If yes, which of the two options would be faster?
Assuming "Faster" means less time to update your URL table, then option 3 seems to be the best option.
You can populate another table with records from XML and then using JOIN to update your main table.