Search code examples
pythonpython-2.7web-scrapingscrapy

Scrapy extract data from dynamic table


I am trying to pull all the TD values from the table="table-main" from the website: http://www.oddsportal.com/basketball/usa/nba/results/

I am using Scrapy and Python 2.7

From Scrapy Shell I can get the table via:

response.xpath('//*[@id="tournamentTable"]')

But I cannot seem to get any of the TR or TD of that table.

response.xpath('//*[@id="tournamentTable"]/tbody')

and response.xpath('//*[@id="tournamentTable"]/tbody/tr')

returns an empty list. I suspect that perhaps the table is created dynamically. How can I scrape all the team names, scores, and odds from that table?

Note on possible duplicate

This question is different to what people recommend is a duplicate here: Scrapy not finding table because that question is about getting the table. This question is about getting the data in the table.


Solution

  • Yes, the results are loaded with an additional call to the website API. In this case the request is made to http://fb.oddsportal.com/ajax-sport-country-tournament-archive/3/MmbLsWh8/X0/1/-1/1/?_=1446338252826.

    I'm not sure you can hardcode the URL in your spider since, at least, there are these 3 and MmbLsWh8 parts of the URL that are actually coming from a script tag on the main page:

    <script type="text/javascript">
        //<![CDATA[
        var op = new OpHandler();if(!page)var page = new PageTournament({"id":"MmbLsWh8","sid":3,"cid":200,"archive":true});var menu_open = null;vJs();op.init();if(page && page.display)page.display();    var sigEndPage = true;
        try
        {
            if (sigEndJs)
            {
                globals.onPageReady();
            }
        } catch (e)
        {
        }
    
        //]]>
    </script>
    

    Plus, there is a _ parameter, that looks like a timestamp.

    The call to this AJAX url would return you a JSONP response with an HTML code of the NBA results inside. You need to extract the HTML code from the response (with a regular expressions, for instance), feed it to a Selector and extract the results. Some sample code from the shell to get you started:

    $ scrapy shell http://www.oddsportal.com/basketball/usa/nba/results/
    In [1]: fetch("http://fb.oddsportal.com/ajax-sport-country-tournament-archive/3/MmbLsWh8/X0/1/-1/1/?_=1446338252826")
    In [2]: import re
    In [3]: pattern = re.compile(r'"html":"(.*?)"}', re.MULTILINE | re.DOTALL)
    In [4]: import scrapy
    In [5]: selector = scrapy.Selector(text=pattern.search(response.body).group(1))
    In [6]: # TODO: now use the selector to extract the desired data