Search code examples
javascriptpythonajaxscreen-scrapingscrapy

Can scrapy be used to scrape dynamic content from websites that are using AJAX?


I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.

Most of the issues are solvable and I'm having a good little mess around. However I'm hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.

Now my experience with dynamic web content is low, so this thing is something I'm having trouble getting my head around.

I think Java or Javascript is a key, this pops up often.

The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don't. I'm using the scrapy library with Python 2.7

I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?


See also: How can I scrape a page with dynamic content (created by JavaScript) in Python? for the general case.


Solution

  • Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

    enter image description here

    In the bottom of the picture you can see that I've filtered request down to XHR - these are requests made by javascript code.

    Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.

    After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

    Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.