Search code examples
javahtmlweb-scrapingautomationjsoup

How to parse tabular data from CNBC Markets Page?


I have a program I am writing that takes user input to connect to a site, download it's html into a text, and retrieve data from a table twice a day. I understand the code will not be one size fits all for any page (I will likely "hardwire" the url into the code once I get it working). My issue presently is that my jsoup parser isn't properly reading in the tabular data. I'm not sure if my element selectors are too generic? The table looks like it is in standard table/tr/td format, but my rows array populates with size 0. If someone could help me debug my parser and possibly provide some suggestions on where to look for making it grab data silently twice a day, I'd really appreciate it! No runtime/compile errors, just need to correct output.

Source site: https://www.cnbc.com/us-markets/ Source code for table (snipet) :

<table class="BasicTable-table"><thead class="BasicTable-tableHeading BasicTable-tableHeadingSortable"><tr><th class="BasicTable-textData"><span>SYMBOL <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData"><span>PRICE <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData">

My code:

public class StockScraper {

public static void main(String[] args) {
    Scanner input = new Scanner (System.in);
    System.out.println("Enter the complete url (including http://) of the site you would like to parse:");
    String html = input.nextLine();
    try {
        Document doc = Jsoup.connect(html).get();
        System.out.printf("Title: %s", doc.title());
        //Try to print site content
        System.out.println("");
        System.out.println("Writing html contents to 'html.txt'...");
        //Save html contents to text file
        PrintWriter outputfile = new PrintWriter("html.txt");
        outputfile.print(doc.outerHtml());
        outputfile.close();

        //Select stock data you want to retrieve
        System.out.println("Enter the name of the stock you want to check");
        String name = input.nextLine();

        //Pull data from CNBC Markets
        Element table = doc.select("table").get(0);
        Elements rows = table.select("tr");
        System.out.println(rows.size());
        for(int i = 1; i < rows.size(); i++) {
            Element rowx = rows.get(i);
            Elements col = rows.select("td");
            if(col.get(0).equals(name)) {
                System.out.println("I worked!");
                System.out.println(col.get(1));
            }
        }
} catch (IOException e) {
        e.printStackTrace();
    }
}

Solution

  • The problem here is that this site is a dynamic page that is loading content after the browser initially downloads the page. Jsoup is not going to be adequate to scrape pages like this. A couple options you have:

    1) Use a tool that simulates a browser and makes all the necessary api calls. A couple options are Selenium WebDriver or HTMLUnit.

    2) Figure out the api calls you are interested in on this site, and just call those api's directly to get a JSON document you can parse. You can see api details by opening developer tools in your browser, then look at the Network tab. For this site an example would be the following, which includes the stock quote for DJI:

    https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&partnerId=2&fund=1&exthrs=0&output=json&symbolType=issue&symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&requestMethod=extended
    
    Returns:
    
    ExtendedQuoteResult: {
      xmlns: "http://quote.cnbc.com/services/MultiQuote/2006",
      ExtendedQuote: [{
        QuickQuote: {
          symbol: ".DJI",
          code: "0",
          curmktstatus: "REG_MKT",
          FundamentalData: {
          yrlodate: "2020-03-23",
          yrloprice: "18213.65",
          yrhidate: "2020-02-12",
          yrhiprice: "29568.57"
        },
        mappedSymbol: {
          xsi:nil: "true"
        },
        source: "Exchange",
        cnbcId: "599362",
        prev_prev_closing: "21413.44",
        high: "22783.45",
        low: "21693.63",
        provider: "CNBC Quote Cache",
        streamable: "0",
        last_time: "2020-04-06T17:16:28.000-0400",
        countryCode: "US",
        previous_day_closing: "21052.53",
        altName: "Dow Industrials",
        reg_last_time: "2020-04-06T17:16:28.000-0400",
        last_time_msec: "1586207788000",
        altSymbol: ".DJI",
        change_pct: "7.73",
        providerSymbol: ".DJI",
        assetSubType: "Index",
        comments: "RIC",
        last: "22679.99",
        issue_id: "599362",
        cacheServed: "false",
        responseTime: "Mon Apr 06 19:12:09 EDT 2020",
        change: "1627.46",
        timeZone: "EDT",
        onAirName: "Dow Industrials",
        symbolType: "issue",
        assetType: "INDEX",
        volume: "614200990",
        fullVolume: "614200990",
        realTime: "true",
        name: "Dow Jones Industrial Average",
        quoteDesc: { },
        exchange: "Dow Jones Global Indexes",
        shortName: "DJIA",
        cachedTime: "Mon Apr 06 19:12:09 EDT 2020",
        currencyCode: "USD",
        open: "21693.63"
      }
    }
    ...