So I'm currently trying to gather data from csgo gambling sites to analyze them. So I wrote a very short programm extracting the html code from this website but it won't extract the content of the web app. My problem now is that I need the information within this web app. I mean I can view it in Chrome so I guess there will be solution. Maybe the pictures help to understand what I'm looking for:
HTML code; marked the line I want
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args) {
try {
String html = Jsoup.connect("https://www.wtfskins.com/crash").get().html();
System.out.println(html);
} catch (IOException e) {
e.printStackTrace();
}
}
}
So that's what I get. I need the content of
<body> <app-root>
loading... // That's the problem
</app-root>
<script src="https://code.jquery.com/jquery-3.1.1.min.js" integrity="sha256-hVVnYaiADRTO2PzUGmuLJr8BLUSjGIZsDYGmIJLv2b8=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/tether/1.4.0/js/tether.min.js" integrity="sha384-DztdAPBWPRXSA/3eYEEUWrWCy7G5KFbe8fFjk5JAIxUYHKkDx6Qin1DkWx51bBrb" crossorigin="anonymous"></script>
<script src="/assets/js/jquery-ui.min.js"></script>
<script src="/assets/js/bootstrap.js"></script>
<script src="/assets/js/sha3.js"></script>
<script src="/assets/js/sha256.js"></script>
<script type="text/javascript" src="inline.318b50c57b4eba3d437b.bundle.js"></script>
<script type="text/javascript" src="polyfills.2b75d68d2d6cb678fc8d.bundle.js"></script>
<script type="text/javascript" src="main.7932c68952979c366236.bundle.js"></script>
</body>
The data is loaded in the page after the initial DOM.
When you are getting data with JSoup
, you get the initial html request.
If you check the
Network
tab in the dev tools
in the browser, you will see that after the initial load there will be extra XHR requests, getting the data.
ngcontent
attributes of tags assure that the page is loaded using Angular, which is a Javascript framework.
This is done to make page loads more efficient and protect from the scraping a bit more.
The network tab shows multiple requests after the page load that have JSON responses. You need to look at those, see which request headers are mandatory to request them. As image shows, one of interesting ones is: https://www.wtfskins.com/api/v1/p2ptrading/usertrades/
You can start by looking at How the Web works with subcategories about Async Javascript requests and REST API basics as well. If you are not familiar with web dev, the research will take a bit of time.