Search code examples
javahtmljsoup

How to prevent Jsoup from erasing angle-brackets inside text when parsing


I am trying to parse only the texts of an html document which contains angle-brackets as part of text.

For example, the html file would look something like this:

<html>
 <head></head> 
 <body> 
  <div>
    <p>1. <someUnicodeString></p> 
    <p>2. <foo 2012.12.26.></p> 
    <p>3. <123 2012.12.26.></p> 
    <p>4. <@ 2012.12.26.></p> 
    <p>5. foobarbar</p> 
  </div>
 </body>
</html>

I want the outcome of the parsed textfile to be like this:

1. <someUnicodeString> 
2. <foo 2012.12.26.> 
3. <123 2012.12.26.> 
4. <@ 2012.12.26.> 
5. foobarbar

I am using Jsoup's parse function to achieve this as shown below,

Document doc = null;

try {
    doc = Jsoup.parse(new File(path), "UTF-8");
    doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
    doc.outputSettings().escapeMode(EscapeMode.xhtml);

    //set line breaks in readable format
    doc.select("br").append("\\n");
    doc.select("p").prepend("\\n\\n");
    String bodyText = doc.body().html().replaceAll("\\\\n", "\n");
    bodyText = Jsoup.clean(bodyText, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));

    File f = new File(textFileName+".txt");
    f.getParentFile().mkdirs();
    PrintWriter writer = new PrintWriter(f, "UTF-8");
    writer.print(Parser.unescapeEntities(bodyText, false));
    writer.close();
} catch(IOException e) {
    //Do something
    e.printStackTrace();
}

However once Jsoup goes through the parsing process, it adds tags for each angle-bracket followed by characters.

<p>1. <someUnicodeString></someUnicodeString></p> 
<p>2. <foo 2012.12.26.></foo></p> 
<p>3. <123 2012.12.26.></p> 
<p>4. <@ 2012.12.26.></p> 
<p>5. foobarbar</p> 

Eventually producing the outcome

1.  
2.  
3. <123 2012.12.26.> 
4. <@ 2012.12.26.> 
5. asdasd 

How can I prevent Jsoup from erasing angle-brackets inside text when parsing?

Or is there a way to make Jsoup to recognize that certain angle-brackets are not html elements? (perhaps using regex?)

I am new to Jsoup and would very much appreciate any kind of help. Thank you.


Solution

  • Thanks to the comment of Davide Pastore, and the question "Right angle bracket in HTML"

    I was able to solve the problem with the following code.

    doc = Jsoup.parse(new File(path), "UTF-8");
    //replace all left-angle tags inside <p> element to "&lt;"
    Elements pTags = doc.select("p");
    for (Element tag : pTags) {
        //change the boundary of the regex to whatever suits you
        if (tag.html().matches("(.*)<[a-z](.*)")) {
            String innerHTML = tag.html().replaceAll("<(?=[a-z])", "&lt;");
            tag.html(innerHTML);
        }
    }
    

    If you go through the process of converting "<" in text to < before you start parsing, you will be able the get the right output.