Search code examples
javaweb-scrapingjsoup

Jsoup hyperlink scraping not working for some websites


I've been working on a project recently which involves scraping specific products from websites and reporting the availability status(Graphics cards if anyone is curious). Using JSOUP, I've been doing this by going through product listing pages, scraping all the links and filtering out the appropriate links. For some websites my code works completely fine but for others, some or even no links are scraped by my code.

Working example:

  1. https://www.bhphotovideo.com/c/buy/Graphic-Cards/ci/6567

Non-Working example:

  1. https://www.bestbuy.com/site/computer-cards-components/video-graphics-cards/abcat0507002.c?id=abcat0507002
  2. https://www.evga.com/products/productlist.aspx?type=0

Here is the snipped of code in charge of scraping the links:

public class LinkScrapeLite {

    public static void main(String[] args) {
        try {

            Document doc = Jsoup.connect("https://www.evga.com/products/productlist.aspx?type=0").get(); //Evga gives me no output whatsoever

            String title = doc.title();
            System.out.println("title: " + title);

            Elements links = doc.select("a[href]");
            for (Element link : links) {
                // get the value from the href attribute
                System.out.println("nlink: " + link.attr("href"));
                System.out.println("text: " + link.text());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

I understand that what I'm doing is by no means efficient so if anyone has any suggestions of how I could do this in a better way please let me know :)



Solution

  • in this case you need a library that allows to wait loading of javascript for example we can use htmlunit

    here is the solution for the evga site:

    String url = "https://www.evga.com/products/productlist.aspx?type=0";
    
            try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
                webClient.getOptions().setThrowExceptionOnScriptError(false);
                webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
                webClient.getOptions().setPrintContentOnFailingStatusCode(false);
                HtmlPage htmlPage = webClient.getPage(url);
                webClient.waitForBackgroundJavaScript(1000);
                webClient.waitForBackgroundJavaScriptStartingBefore(1000);
                final List<DomElement> hrefs = htmlPage.getByXPath("//a");
                for (DomElement element : hrefs) {
                    System.out.println(element.getAttribute("href"));
                }
            }