Search code examples
pythonweb-crawlerwikiwikipedia

Finding shotest path between Wikipedia articles


I am writing a python web crawler to find a path between Wikipedia articles.

I have a start article and a goal article and I am trying to find a short path between them.

Right now I am basically just doing a breadth search from the start to the goal with some code like this.

 for link in to_crawl:
    links = get_all_links(source(link), crawled)
    if goal in links:
        return path+[link]+[goal]
    crawled.append(link)
    to_crawl.append(links)

It is getting from one article to another if they are only a few degrees away, but I need a way to keep track of the path I took.


Solution

  • So just keep track of it. Instead of having a list of links, have a list of link, path pairs. Something like this:

    to_crawl = [(start_page, [])]
    for link, path in to_crawl:
        links = get_all_links(source(link), crawled)
        if goal in links:
            return path+[link]+[goal]
        crawled.append(link)
        to_crawl.extend((new_link, path + [new_link]) for new_link in links)
    

    Also note that you had a serious problem with your existing code: to_crawl.append(links) appends a list of links as if it were a single link, when obviously you wanted to append each link in that list separately. I've fixed that by using extend.

    As a side note, path+[link]+[goal] is an odd thing to be returning. For example, if you went from page A to page D via path A-B-C-D, you're going to end up with B, C, D, C, D as your return value, which seems odd to say the least. If you need the last link and the goal separately from the path, why not just return path, link, goal instead of packing them onto the path?