Search code examples
phpmysqlcurlxmlreader

Which of these three options would be fastest to populate my DB table?


I have a (huge: 6.4 million rows) database table with three columns as follows:

URL     Title      Description

Right now only the URL column is populated, with various urls from around the web. I have to add a title and meta description to the url in each row. There are three ways I see I can do this:

Option 1: Use php and CURL to go to each url and grab the title and meta:description. (Even with using options 2 or 3 I'll have to do this for at least part of the urls).

Option 2: I have a DMOZ xml file which has several million elements which look like this:

    <ExternalPage about="http://animation.about.com/">
       <d:Title>About.com: Animation Guide</d:Title>
       <d:Description>Keep up with developments in online animation for all skill levels.     Download tools, and seek inspiration from online work.</d:Description>
       <topic>Top/Arts/Animation</topic>
    </ExternalPage>
    <ExternalPage about="http://www.toonhound.com/">
       <d:Title>Toonhound</d:Title>
       <d:Description>British cartoon, animation and comic strip creations - links, reviews  and news from the UK.</d:Description>
       <topic>Top/Arts/Animation</topic>
    </ExternalPage>

etc.

I can use xmlreader to compare the url in every ExternalPage-about to see if it matches one of the 6.4 million urls in my table, and if it does, add the title and description.

Option 3: I can write all the above dmoz data into a separate database table (again using xmlreader), which will also take time, and then somehow use JOINs and SELECTs on the two tables to get the information I need.

Would option 2 or 3 be quicker for those urls than just using option 1 for all the urls? If yes, which of the two options would be faster?


Solution

  • Assuming "Faster" means less time to update your URL table, then option 3 seems to be the best option.

    You can populate another table with records from XML and then using JOIN to update your main table.