Search code examples
htmlweb-scrapingjsoup

Jsoup - hidden div class?


Im trying to scrape a div class but everything I have tried has failed so far :(

Im trying to scrape the element(s):

<a href="http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs- 
scope"><div class="s_buttons_button s_buttons_buttonAlt 
s_buttons_buttonSlashBack">More info</div></a>

from the website: http://www.bellator.com/events

I tried accessing the list of elements by doing

Elements elements = document.select("div[class=s_container] > li");

but that didnt return anything.

Then i tried accessing just the parent with

Elements elements = document.select("div[class=s_container]");

and that returned two div with classname "s_container", non of which is the one I needed :<

then i tried accessing that ones parent with

Elements elements = document.select("div[class=ent_m152_bellator module 
ent_m152_bellator_V1_1_0 ent_m152]");

And that didnt return anything

I also tried

Elements elements = document.select("div[class=ent_m152_bellator]");

because I wasnt sure about the white spaces but it didnt return anything either

Then I tried accessing its parent by

Elements elements = document.select("div#t3_lc");

and that worked, but it returned an element containing

<div id="t3_lc"> 
<div class="triforce-module" id="t3_lc_promo1"></div> 
</div>

which is kinda weird because i cant see that it has that child when i inspect the website in chrome :S

Anyone knows whats going on? I feel kinda lost..


Solution

  • What you see in your web browser is not what Jsoup sees. Disable JavaScript and refresh page to get what Jsoup gets OR press CTRL+U ("Show source", not "Inspect"!) in your browser to see original HTML document before JavaScript modifications. When you use your browser's debugger it shows final document after modifications so it's not not suitable for your needs.

    It seems like whole "UPCOMING EVENTS" section is dynamically loaded by JavaScript. Even more, this section is asynchronously loaded with AJAX. You can use your browsers debugger (Network tab) to see every possible request and response.

    enter image description here

    I found it but unfortunately all the data you need is returned as JSON so you're going to need another library to parse JSON.

    That's not the end of the bad news and this case is more complicated. You could make direct request for the data: http://www.bellator.com/feeds/ent_m152_bellator/V1_1_0/d10a728c-547e-4a6f-b140-7eecb67cff6b but the URL seems random and few of these URLs (one per upcoming event?) are included inside JavaScript code in HTML.

    enter image description here

    My approach would be to get the URLs of these feeds with something like:

    
            List<String> feedUrls = new ArrayList<>();
    
            //select all the scripts
            Elements scripts = document.select("script");
            for(Element script: scripts){
                if(script.text().contains("http://www.bellator.com/feeds/")){
                    // here use regexp to get all URLs from script.text() and add them to feedUrls
    
                }
            }
    
            for(String feedUrl : feedUrls){
                // iterate over feed URLs, download each of them
                String json = Jsoup.connect(feedUrl).ignoreContentType(true).get().body().toString();
                // here use JSON parsing library to get the data you need
    
            }
    

    ALTERNATIVE approach would be to stop using Jsoup because of its limitations and use Selenium Webdriver as it supports dynamic page modifications by JavaScript so you'd get the HTML of the final result - exactly what you see in web browser and Inspector.