Search code examples
javascripthtmlcanvasweb-crawlerdata-extraction

Is it possible to extract data from these websites that don't output data in the HTML source code?


Many years ago I used to use Perl and Python to crawl through some websites by looking at data in the HTML source code.

Now I would like to do another personal project that involves extracting numerical data from:

  1. Table elements on this PredictIt Website

  2. Individual graph elements (x and y for each) on this PredictWise Website

  3. Individual graph elements (x and y for each) on this Five Thirty Eight Website

None of these web pages' HTML source code contain the numerical data. Is there a way to extract these data? If so, where?

I feel like there must be a way, because these are all front-end information that the browser needs to render the charts and graphs.

(I can't find raw-data provided to developers on these webpages. So I guess I have to extract data myself.)


Solution

  • The table elements on the first link are indeed readable from the rendered HTML. If using Chrome, right click on the text and choose "Inspect." The Chrome debugger will show you the exact HTML element that contains the data.

    The other links are more difficult. I don't see a way to view the data in raw HTML, but on the second link I am able to see the JSON data supplying the graphs with their data from the server. You may be able to parse that for your project. The data look like this:

    {"id":"1687","name":"Hawaii Caucus - DEM","notes":"","suppress_timestamp":"0","header":["Outcome","PredictWise","Derived Betfair Price","Betfair Back","Betfair Lay","Pollster","Derived PredictIt"],"default_sort":"2","default_sort_dir":"desc","shade_cols":["1"],"history":[{"timestamp":"03-17-2016 1:03PM","table":[["Hillary Clinton","43 %",null,null,null,null,"$ 0.425"],["Bernie Sanders","57 %",null,null,null,null,"$ 0.570"]]},...
    

    Open the Chrome debugger on that website and goto the Network tab. From there, look for requests for "table_xxxx.json" . You can see the URL for requesting the data, and the raw data returned from the server.

    Hope this helps!