Search code examples

How to use xapian which returns a URL when indexing a web page

I am using Ubuntu 12.04, Python 2.7

My code for getting the contents from a given URL:

def get_page(url):
'''Gets the contents of a page from a given URL'''
        f = urllib.urlopen(url)
        page =
        return page
        return ""
    return ""

To filter the content of a page provided by get_page(url):

def filterContents(content):
'''Filters the content from a page'''
    filteredContent = ''
    regex = re.compile('(?<!script)[>](?![\s\#\'-<]).+?[<]')
    for words in regex.findall(content):
        word_list = split_string(words, """ ,"!-.()<>[]{};:?!-=/_`&""")
        for word in word_list:
            filteredContent = filteredContent + word
    return filteredContent

def split_string(source, splitlist):
    return ''.join([ w if w not in splitlist else ' ' for w in source])

How to index the filteredContent in Xapian so that when I query, i get returned the URLs the query was present in?


  • I'm not completely clear what your filterContents() and split_string() are actually trying to do (throwing away some HTML tag contents and then word splitting), so let me talk through a similar problem that doesn't have that complexity folded into it.

    Let's assume we have a function strip_tags() which returns just the textual content of an HTML document, and your get_page() function. We want to build up a Xapian database where

    • each document represents the resource representation pulled from a particular URL
    • the "words" in that representation (having been passed through strip_tags()) become searchable terms that index those documents
    • each document contains as its document data the URL it was all pulled from.

    So you could index as follows:

    import xapian
    def index_url(database, url):
        text = strip_tags(get_page(url))
        doc = xapian.Document()
        # TermGenerator will split text into words
        # and then (because we set a stemmer) stem them
        # into terms and add them to the document
        termgenerator = xapian.TermGenerator()
        # We want to be able to get at the URL easily
        # And we want to ensure each URL only ends up in
        # the database once. Note that if your URLs are long
        # then this won't work; consult the FAQ on unique IDs
        # for more:
        idterm = 'Q' + url
        db.replace_document(idterm, doc)
    # then index an example URL
    db = xapian.WritableDatabase("exampledb", xapian.DB_CREATE_OR_OPEN)
    index_url(db, "")

    Searching is then simple, although it can obviously get more sophisticated if needed:

    qp = xapian.QueryParser()
    query = qp.parse_query('question')
    query = qp.parse_query('question and answer')
    enquire = xapian.Enquire(db)
    for match in enquire.get_mset(0, 10):
        print match.document.get_data()

    which will display '', since 'question and answer' is on the homepage when you aren't logged in.

    I'd recommend checking out the Xapian getting started guide both for concepts and code.