Search code examples
javaselenium-webdriverweb-crawlerjsoup

Scraping full dynamic HTML content using Java JSoup and Selenium


I am trying to scrape this website

https://www.dailystrength.org/search?query=aspirin&type=discussion

to gain a data-set for a project I have (using aspirin as placeholder search item).

I have decided to use Jsoup to make a crawler. But the problem is that the posts are dynamically brought with Ajax request. The request is made using Show more button

This button causes the problems

When the entire content is shown it should look like this with the text "All Messages Loaded"

end result

import java.io.IOException;
import java.util.ArrayList;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.*;
import org.openqa.selenium.chrome.*;

/**
 *
 * @author Ahmed
 */
public class Crawler {

    public static void main(String args[]) {
        Document search_result;
        String requested[] = new String[]{"aspirin"/*, "Fentanyl"*/};
        ArrayList<Newsfeed_item> threads =  new ArrayList();

        String query = "https://www.dailystrength.org/search?query=";

        try {
            for (int i = 0; i < requested.length; i++) {
                search_result = Jsoup.connect(query+requested[i]+"&type=discussion").get();

                Elements posts = search_result.getElementsByClass("newsfeed__item");
                for (Element item : posts) {

                    Elements link=item.getElementsByClass("newsfeed__btn-container posts__discuss-btn");

                    Newsfeed_item currentItem=new Newsfeed_item();
                    currentItem.replysLink=link.attr("abs:href");
                    Document reply_result=Jsoup.connect(currentItem.replysLink).get();
                    Elements description = reply_result.getElementsByClass("posts__content");

                    currentItem.description=description.text();
                    currentItem.subject=requested[i];
                    System.out.println(currentItem);

                }
            }
        } catch (IOException ex) {
            Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
        }

    }
}

This code gives me only the few posts that are shown and not the hidden posts. I understood that JSoup can't be used for this issue so I tried to find sources for selenium to show the full content and download it for crawling.

I can't find any sources, and the only code found to try for initial understanding from

https://www.youtube.com/watch?v=g1IbI_qYsDg

Gives me this error

Exception in thread "main" java.lang.IllegalStateException: The path to the driver executable must be set by the webdriver.gecko.driver system property; for more information, see https://github.com/mozilla/geckodriver. The latest version can be downloaded from https://github.com/mozilla/geckodriver/releases
    at com.google.common.base.Preconditions.checkState(Preconditions.java:847)
    at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:134)
    at org.openqa.selenium.firefox.GeckoDriverService.access$100(GeckoDriverService.java:44)
    at org.openqa.selenium.firefox.GeckoDriverService$Builder.findDefaultExecutable(GeckoDriverService.java:167)
    at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
    at org.openqa.selenium.firefox.FirefoxDriver.toExecutor(FirefoxDriver.java:190)
    at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:147)
    at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:125)
    at SeleniumTest.main(SeleniumTest.java:14)
C:\Users\Ahmed\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 0 seconds)

Any help or sample code or alternatives? I just need to get the full page and I use the crawler I have. Or make a completely new crawler but than I can't find code and I run into errors.


Solution

  • I'll try to continue the approach without selenium. Using your web browser's debugger and its Network tab you can peek at all the requests your browser sends.

    enter image description here

    It's useful to take a look what happens when you click "show more". You can see there the next page is loaded from this url: https://www.dailystrength.org/search/ajax?query=aspirin&type=discussion&page=2&_=1549130275261 And you can get more pages by changing parameter page=2. Unfortunately the result comes as JSON containing escaped HTML so you'd have to use some JSON library to parse it, obtain HTML and then parse it with Jsoup. That would be nice to have as this JSON includes also a variable "has_more":true so you'd know if there's more content.