Search code examples
javaandroid-studioweb-scrapingjsouphtml-parsing

Going to next page when web scraping with Jsoup


I am trying to scrape this https://www.actksa.com/ar/training-courses/training-in/Jeddah with Jsoup the code I wrote only takes the Subjects on the first page.

try {
                String url = "https://www.actksa.com/ar/training-courses/training-in/Jeddah";

                Document doc = Jsoup.connect(url).get();

                Elements data = doc.select("tr");
                int size = data.size();
                Log.d("doc", "doc: "+doc);
                Log.d("data", "data: "+data);
                Log.d("size", ""+size);
                for (int i = 0; i < size; i++) {

                    String title = data.select("td.wp-60")
                            .eq(i)
                            .text();

                    String detailUrl = data.select("td.wp-60")
                            .select("a")
                            .eq(i)
                            .attr("href");
                    parseItems.add(new ParseItem(title, detailUrl));
                    Log.d("items"," . title: " + title);

How can I continue scraping subjects from the next pages? I noticed that I may be able to use pagination but I am not sure how. and second, the link changes slightly when going to the next page so I could maybe use that. What's the code to go to the next pages and continue scraping titles?


Solution

  • It seems the pagination for that site is controlled by the ?page=<int> query parameter. Simply wrap your existing code in a for loop that will control the current page.

    int numPages = 5; // the number of pages to scrape
    for (int i = 0; i < numPages; i++) {
        String url = "https://www.actksa.com/ar/training-courses/training-in/Jeddah?page=" + i;
    
        Document doc = Jsoup.connect(url).get();
    
        Elements data = doc.select("tr");
        int size = data.size();
        Log.d("doc", "doc: "+doc);
        Log.d("data", "data: "+data);
        Log.d("size", ""+size);
        for (int j = 0; j < size; j++) {
             String title = data.select("td.wp-60")
                    .eq(j)
                    .text();
             String detailUrl = data.select("td.wp-60")
                    .select("a")
                    .eq(j)
                    .attr("href");
            parseItems.add(new ParseItem(title, detailUrl));
            Log.d("items"," . title: " + title);
        }
    }
    

    If you want to get all the pages without hardcoding the numbers, you put the incrementing in a while loop that will break when the table on the page has no contents. For example https://www.actksa.com/ar/training-courses/training-in/jeddah?page=6 is not a valid page, and just shows a page with an empty table.