Search code examples
javahtmljunitjsoup

Disable JSoup whitespace removal


I'm working on a project where I need to test some html output. To do this, I am using JSoup to extract the elements I want to test against and running assertions on the output. The problem I'm running into is that JSoup 'cleans' the output before returning it, so my outputs do not match my inputs, even if the original html was correct. All the suggestions I have run across suggest disabling pretty print via the output settings. Unfortunately, so far that solution has not worked. I'm not sure whether I am simply not disabling the output formatting correctly or if there is something else going on. Any suggestions would be appreciated.

@Test
public void testJsoupParse()
{
    Document testDoc = Jsoup.parse("<html> <span id='sp1'><strong>ABC  123</strong></span> <span id='sp2'>XYZ 098</span> </html>");
    testDoc.outputSettings().prettyPrint(false);

    String sp1 = testDoc.select("span#sp1").text();
    System.out.println(sp1);
    String spHtml = testDoc.select("span#sp1").html();
    System.out.println(spHtml);
    //this should pass, but fails due to the extra space being stripped out
    assertThat(sp1).isEqualTo("ABC  123");
    //this will also fail since .html() will include the <strong> tags in the output
    assertThat(spHtml).isEqualTo("ABC  123");
    //this will pass
    assertThat(testDoc.select("span#sp2").text()).isEqualTo("XYZ 098");
}

Solution

  • Use wholeText instead of text

    String sp1 = testDoc.getElementById("sp1").wholeText();
    assertThat(sp1).isEqualTo("ABC  123");
    

    Side note: For your simple sample html it may make no difference but your select is not correct. Instead of

    String sp1 = testDoc.select("span#sp1").text();
    

    use selectFirst or getElementById since you want to get an Element. Select returns Elements (i.e a list of elements). So something like

    String sp1 = testDoc.selectFirst("span#sp1").wholeText();
    String sp1 = testDoc.getElementById("sp1").wholeText();