Search code examples
javajavafxcss-selectorsjsoupjavafx-webengine

Webpage content selector using WebEngine


I want to load web content from a site URI and use selectors to get some useful information. I tried JSoup which allows me to select elements with a CSS Selector. Unfortunately Jsoup doesn't work as a browser and doesn't parse Javascript or handles cookies. That's why I looked into using the JavaFX WebEngine which works as a browser. But The WebEngine class returns Documents which are very limited in selector possibilities compared to JSoup. The only selectors are by Id or by Tag.

Is there a clean way to use the WebEngine of JavaFX with more specialized selector possibilities?

Or are there other browser implementations in Java that allow for more specialized selections? The implementation should preferably be fast.

The best solution I can come up with for now is the following:

  1. Use WebEngine of JavaFX to get a Document object of which the Javascript is parsed.
  2. Convert the Document to a String using a Transformer.
  3. Pass this String as argument to a JSoup object and use their CSS selector capabilities.

Solution

  • Jsoup does support cookies. You just need to collect them and send them in every request along. So it is some work involved, but it is possible.

    Your solution will work, but I doubt that the WebEngine of JavaFX is your best option, unless your application is employing JavaFX anyway and you need to display web content as well. In case you need it only for the task you described I would certainly recommend selenium webdriver for the job. With that you can remote control a real browser to access all content. There are bindings to many standard browsers, including phantomjs as a headless webkit solution for maximum compatibility and HTMLUnit for a Java only solution.

    However, if speed is of great concern, I would give Jsoup another try. Try to find the AJAX calls the Javascript triggers and get the stuff you need directly. This would be much faster than selenium or WebEngine.