Search code examples
htmljsouppre

jsoup remove outer html tag - code HTML element


This seems pretty straightforward but, obviously I'm doing something wrong. Here is my HTML - I am trying to create both singular code tags and nested code tags under pre tags - the resulting content will be "one line boxes with code inside" and also with the pre tags "big boxes with code inside". There are also empty paragraph tags that I can't get rid of using the standard methods - element remove after testing for no Text in the paragraph. Here's the input

        <h1>Module Description and Learning Objectives</h1>  
        <p> 
        </p> 
        <pre>                
        <p>
        <code>2020-02-13 12:49:15 DEBUG StackTraceElement:48 -</code>
        </p>
        <p>
        <code>2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.title</code>
        </p>
        <p>
        <code>2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.id=1</code>
        </p>
        <p>
        </p>
        </pre> 
        <p>Sentence 1</p> 
        <p> 
        <code>System.out.println("id:"+element.attr("id"));</code> 
        </p> 
        <p>Sentence 2</p> 
        <p> 
        <code>System.out.println("src:"+element.attr("src"));</code> 
        </p> 
        <p>Sentence 3</p> 
        <p> 
        <code>System.out.println("alt:"+element.attr("alt"));</code> 
        </p> 
        <p> 
        </p> 

This is my code (don't follow names as much as constructs, mid-code names :)

          Elements pWithCodeTagList = docXMLformat.select("code");
          if (pWithCodeTagList.size() > 0) {
              for (Element pTag : pWithCodeTagList) {
                   System.out.println("pTag=" + pTag.text() + " " + pTag.tagName());
                   pTag.unwrap();
              }
          }

Here is the output in eclipse - I am indeed selecting the code tags and expecting the parent p to disappear

 pTag=2020-02-13 12:49:15 DEBUG StackTraceElement:48 - code
 pTag=2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.title code
 pTag=2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.id=1 code
 pTag=System.out.println("id:"+element.attr("id")); code
 pTag=System.out.println("src:"+element.attr("src")); code
 pTag=System.out.println("alt:"+element.attr("alt")); code

This is the result: I expected the paragraph tags to disappear, not the code tags!

   <h1>Module Description and Learning Objectives</h1> 
                <p> 
                </p> 
                <pre>                
                    <p>
                    2020-02-13 12:49:15 DEBUG StackTraceElement:48 -
                </p>
                    <p>
                    2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.title
                </p>
                    <p>
                    2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.id=1
                </p>
                    <p>
                    </p>
                </pre> 
                <p>Sentence 1</p> 
                <p> System.out.println("id:"+element.attr("id")); </p> 
                <p>Sentence 2</p> 
                <p> System.out.println("src:"+element.attr("src")); </p> 
                <p>Sentence 3</p> 
                <p> System.out.println("alt:"+element.attr("alt")); </p> 
                <p> 
                </p> 

I already touch this area of my document, I've removed a span tag around the code tag prior to this, and had to remove all the line control characters from the line content, maybe PRE and CODE don't work like other tags - I know they are not supposed to, but ... also, I'm trying to keep the tags and content on the same line so my "code boxes" are as slim as possible, towit:

 <pre>                
 <code>2020-02-13 12:49:15 DEBUG StackTraceElement:48 -</code>
 <code>2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.title</code>
 <code>2020-02-13 12:49:15 DEBUG StackTraceElement:48 - sects.id=1</code>
  </pre> 
  <p>Sentence 1</p> 
  <code>System.out.println("id:"+element.attr("id"));</code> 
  <p>Sentence 2</p> 
  <code>System.out.println("src:"+element.attr("src"));</code> 
  <p>Sentence 3</p> 
  <code>System.out.println("alt:"+element.attr("alt"));</code> 

Solution

  • Your selector is selecting the code elements, not the p elements, which is why they're getting removed. You should select the p elements that have a code tag, and unwrap() those. p:has(code)

    Also, you don't need to iterate them and call unwrap on each if you want to unwrap them all (unless you want to do extra logic for each one). You can just call Elements#upwrap()

    Elements pWithCodeTagList = docXMLformat.select("p:has(code)");
    pWithCodeTagList.unwrap();
    

    And to find the empty p tags, you could use the :matches selector which does a regex over the text, and just look for spaces or nothing: p:matches(^\s?$)

    Elements emptyPs = docXMLformat.select("p:matches(^\\s?$)");
    emptyPs.remove();