How to extract information from ODP accurately?

I am building a search engine in python.

I have heard that Google fetches the description of pages from the ODP (Open Directory Project) in case Google can't figure out the description using the meta data from the page... I wanted to do something similar.

ODP is an online directory from Mozilla which has descriptions of pages on the net, so I wanted to fetch the descriptions for my search results from the ODP. How do I get the accurate description of a particular url from ODP, and return the python type "None" if I couldn't find it (Which means ODP has no idea what page i am looking for)?

PS. there is a url called http://dmoz.org/search?q=Your+Search+Params but I dont know how to extract information from there.

Solution

To use ODP data, you'd download the RDF data dump. RDF is a XML format; you'd index that dump to map urls to descriptions; I'd use a SQL database for this.

Note that URLs can be present in multiple locations in the dump. Stack Overflow is listed at twice, for example. Google uses the text from this entry as the site description, Bing uses this one instead.

The data dump is of course rather large. Use sensible tools such as the ElementTree iterparse() method to parse the data set iteratively as you add entries to your database. You really only need to look for the <ExternalPage> elements, taking the <d:Title> and <d:Description> entries underneath.

Using lxml (a faster and more complete ElementTree implementation) that'd look like:

from lxml import etree as ET
import gzip
import sqlite3

conn = sqlite3.connect('/path/to/database')

# create table
with conn:
    cursor = conn.cursor()
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS odp_urls 
        (url text primary key, title text, description text)''')

count = 0
nsmap = {'d': 'http://purl.org/dc/elements/1.0/'}
with gzip.open('content.rdf.u8.gz', 'rb') as content, conn:
    cursor = conn.cursor()
    for event, element in ET.iterparse(content, tag='{http://dmoz.org/rdf/}ExternalPage'):
        url = element.attrib['about']
        title = element.xpath('d:Title/text()', namespaces=nsmap)
        description = element.xpath('d:Description/text()', namespaces=nsmap)
        title, description = title and title[0] or '', description and description[0] or ''

        # no longer need this, remove from memory again, as well as any preceding siblings
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]

        cursor.execute('INSERT OR REPLACE INTO odp_urls VALUES (?, ?, ?)',
            (url, title, description))
        count += 1
        if count % 1000 == 0:
            print 'Processed {} items'.format(count)