Search code examples
javajsonparsingjsoup

JSoup doesn't retrieve JSON data from script tag


I'm trying to get content inside a script tag (JSON data) from a recipe in an HTML page, using JSoup (1.13.1). I won't post the HTML code but the script tag content is pretty big.

Whenever I try to print the content, I get an empty string. I tried to get my data using different methods: by selecting the ID doc.select("#__NEXT_DATA__"), or by using doc.select("script[type='application/json']")

If I try to iterate through all the script tags, whenever it gets to the script tag I want, it prints blank. I also tried to print the content using text() method and the toString() method but it doesn't work. I even saw someone saying you could set the maxBodySize(0) but it still doesn't work.

Here is my code:

String url = "https://www.marmiton.org/recettes/recette_gateau-au-chocolat-fondant-rapide_166352.aspx";
doc = Jsoup.connect(url).maxBodySize(0).get();

Elements newsHeadlines = doc.select("#__NEXT_DATA__");
                    
for (Element element : newsHeadlines) {
    System.out.println(element);
}

Solution

  • Treat the script element as data:

    Elements newsHeadlines = doc.select("#__NEXT_DATA__");
    
    for (Element element : newsHeadlines) {
        System.out.println(element.data());
    }
    

    Note that some consoles may have an issue displaying a line of 81206 characters in length (eclipse did for me) (or there was something in the data) so this code simply prints out the beginning...

        for (Element element : newsHeadlines) {
            System.out.println(element.data().length());
            
            int printLen = Math.min(100, element.data().length());
            System.out.println(element.data().substring(0,printLen));
        }
    

    And produces:

    81206
    {"props":{"pageProps":{"recipeData":{"recipe":{"id":166352,"guid":"7bf48b95-4cd2-4b32-8f41-fb6168510
    

    Note if you can use a debugger in your environment it would show that the element had the result all along but as a childNode of element of type DataNode which is the first clue.