Search code examples
javaweb-scrapingjsoup

How to Load Entire Contents of HTML - Jsoup


I was trying to download html table rows using jsoup but it parsing only partial html contents. I tried with below code also for loading full html contents but doesn't work. any suggestion would be appreciated.

public class AmfiDaily {
    public static void main(String[] args) {
        AmfiDaily amfiDaily = new AmfiDaily();

        amfiDaily.extractAmfiTable("https://www.amfiindia.com/intermediary/other-data/transaction-in-debt-and-money-market-securities");
    }

    public  void extractAmfiTable(String url){
        Document doc;

        try {
            FileWriter writer = new FileWriter("D:\\FTRACK\\Amfi Report " + java.time.LocalDate.now() + ".csv");
            Document document = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                    .maxBodySize(0)
                    .timeout(100000*5)
                    .get();

            Elements rows = document.select("tr");  

                 for (Element row : rows) {              

                Elements cells1 = row.select("td");                   

                for (Element cell : cells1) {

                    if (cell.text().contains(",")) {

                        writer.write(cell.text().concat(","));

                    }
                    else
                    {
                        writer.write(cell.text().concat(","));
                    }                       

                }                   

                writer.write("\n");                   
                 }
            writer.close();
        } catch (IOException e) {
            e.getStackTrace();
        }
    }
}

Solution

  • Disable JavaScript to see exactly what Jsoup sees. Part of the page is loaded with AJAX so Jsoup is not able to reach it. But there's an easy way to check where the additional data comes from.

    You can use your browsers debugger to check Network tab and take a look at the requests and responses.

    enter image description here

    You can see that table is downloaded from this URL: https://www.amfiindia.com/modules/LoadModules/MoneyMarketSecurities

    You can use directly this URL to get the data you need.

    To overcome Jsoup's limitation and load whole HTML at once you should use Selenium webdriver, example here: https://stackoverflow.com/a/54510107/9889778