tools for Crawling popular forum/bulletin board software

I've started writing a crawler to crawl vbulletin boards. However, I am not a web programmer (json api's I can do, but that isn't really web-crawling), and as such I do not know what the best way to crawl is, and what tools are available.

I am more than capable of writing the crawler, but the I find the underlying HTML very irregular, and so I don't want to be a victim of the structure of the HTML changing in a newer version of vbulletin.

I'm writing an interface using pycurl and beautiful soup. However, is there a better way to do this, are there any good crawlers already available for vbulletin ? (language is not a concern). A meta forum crawler (works with more than one forum type) would be even better.

If you cannot suggest one, could you advise me, if you have the experience, from what I should expect from the stability of the underlying HTML, should I worry about a new version of vbulletin breaking my crawler ?

Perhaps there is a better way to extract a vbulletin dataset ?

Solution

Having HTML change is an inherit issue with webcrawling. That is why it should only be an absolute last resort. Maintaining crawlers can be a huge task, as you have seen, because HTML can change daily and there are no guarentees.

Because the data that is usually being searched for is uniform, scrapy is an excellent choice. http://doc.scrapy.org/en/0.14/index.html

It uses xpath to select elements, which is relatively easy to mainatin imo.

Even if there is a vbulletin specific scraper it is still dependent on HTML which can break at will. Because vbulletin is a platform you are probably pretty well off scraping it. I would think HTML would only change on version updates which shouldn't be that often.

Does the mobile API provide you with any functionality you need? https://www.vbulletin.com/forum/content.php/367-API-Overview, I guess this depends on per site vbulletin setup.