Search code examples
pythonweb-crawler

How to write a Python robot that browses


Possible Duplicate:
Where shall I start in making a scraper or a bot using python?

I know it's obviously possible... I was asked to implement some sort of robot that visits a website, logs in, visits a set of links, fills a search form with date inputs to get a XLS file and logs off. If done manually this whole ordeal takes almost an hour, so a script/robot would save us a lot of time.

Ideas? Libraries? I guess I'm going to need urllib?
Or maybe not use Python at all?
Thanks in advance!

Edit: I searched quite a bit for "python crawler" and didn't come upon Mechanize or Scrapy until right before the comments :/
I'll look further into Mechanize first. Thanks.


Solution

  • I am a fan of the twill python module. Here is a small sample of the code from it that I used not too long ago to do basic browsing and scraping.

    import twill
    import twill.commands as c
    
    def login():
        c.clear_cookies()
        c.go('http://icfpcontest.org/icfp10/login')
        c.fv(1, 'j_username', 'Side Effects May Include...')
        c.fv(1, 'j_password', '<redacted>')
        c.submit()
        c.save_cookies('/tmp/icfp.cookie')
    
    all_cars_rx = re.compile(r'<td style="width: 20%;">(\d+)</td><td>(\d+)</td>')
    def list_cars():
        c.go('http://icfpcontest.org/icfp10/score/instanceTeamCount')
        cars = re.findall(all_cars_rx, c.show())
        if not cars:
            sys.stderr.write(c.show())
            sys.stderr.write('Could not find any cars')
        return cars;
    

    It is worth mentioning that one should not use a regular expression to parse HTML. What you have here is a dirty hack that was done for the ICFP on a very short time table.