Search code examples
javaxml-parsingjsoupextract

Partial extraction with Jsoup from URL


I trying to extract all the HTML from an URL with Jsoup but checking the extraction, my Document contain only a part of the HTML. Could you help me to solve the issue? Below the code used:

    Document doc = Jsoup.connect("https://www.diretta.it/").get();
    System.out.println(doc);

The result start from:

...
var leftMenuEnvironment = {"trans":{"TRANS_DC_INCIDENT_SUBTYPE_31":"ERS","TRANS_DC_INCIDENT_SUBTYPE_32":"Iniezione","TRANS_DC_INCIDENT_SUBTYPE_33":"
...

and not from:

<body class="responsive background-add-off isWide soccer _fs flat pid_400 mgc oneLineLayout isSportPage fcp-skeleton light-bg-1 v3 bg3 seoTopWrapperHidden theme--dark tablet_ad">
<div class="otPlaceholder otPlaceholder--hidden">
...

Solution

  • Your code is OK, the problem is with your IDE: the html's size is over 170kb and when you print it to the screen with your IDE it will display only the end of it. Try to save it to a file, or print part of it:

    String start = doc.html().substring(0, 500);
    System.out.println(start);
    

    and you'll see the begining of the html.