Search code examples
javajavascripthtmljsoup

Workaround on Scraping HTML by diving into js source code


I learn about jSoup recently and would like to dive more into it. However, I have met obstacle handling webpages with javascript (I have no knowledge in js, yet :/).

I have read that htmlunit would be the correct tool to perform webbrowser actions, but I figured out that I would need no knowledge in js if I can find out the JSON object obtained in the webpage using the javascript.

For example, this page: among the source files, one of them is tooltips.js. In this file, variable rgNeededFeeds is generated and called in method LoadHeropediaData(), which is the method to generate the whole URL link for getting the json object.

URL = URL + 'jsfeed/heropediadata?feeds='+strFeeds+'&v=3633666222511362823&l=english';

I could not get my mind on what is actually strFeeds. I have tried various combinations but it doesn't work (it returned an empty array...). Or, my guess is totally off?

What I actually need is the data it displays on top when you click on one of the "items". The info in the "hover" would do too, but it lack the "recepi" info. And I'm presuming that by getting the json object from the full URL above, well, basically all data infos should be in that json.

Anyways, this is only based on what I understand from staring at those source files for hours. Do correct me if I'm wrong. (I'm in Java by the way)

**p/s: I would also like to take this opportunity to express my thanks to Balusc, he has been everywhere when I have doubts on jSoup. :>*


Solution

  • strFeeds is nothing but one of these two strings : itemdata or abilitydata

    You can find this in tooltips.js at line 38-45

    var rgNeededFeeds = [];
    $.each( [ 'item', 'ability' ], 
    function( i, ttType ){
            icons = GetIconCollection( ttType );
            if ( icons.length ){
                rgNeededFeeds.push( ttType+'data' );
                     //..............
                }
         }
    )    
    

    ttType is the value of an iteration over the array [ 'item', 'ability' ] which concatenated with the string data is pushed into the array rgNeededFeeds

    The function LoadHeropediaData is called at the end of the function above with rgNeededFeeds as parameter :

    LoadHeropediaData( rgNeededFeeds );
    

    Aside note : If you begin to start scraping websites, learning javascript will be MANDATORY.

    NOTE : you're right, the JSON contains all the information needed...