Search code examples
javahtmljsoup

Deleting parent tags without deleting children with Jsoup


Sample code to remake:

       <div class="mrd3w m6et0 _2d49e_1O4vF"> 
        <div class="p1td4 pw4go p513t al2kje m10qy mij5n"> 
         <div class="_2d49e_2tor6" style="max-width:871px;max-height:552px"> 
          <div class="ptv8j2" style="padding-top:calc(100% * 552 / 871)">
           <img alt="alt" class="_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded" sizes="(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw" src="https://somelink.com 871w" width="871px">
          </div> 
         </div> 
        </div> 
       </div> 

I have already deleted some usless links and imports from this html and this is my last problem. Classes of divs are random and there are a lot of them.

I need to get simple clean code like this:

<div>
  <img alt="alt" src="https://somelink.com">
</div>

I am creating xml file from databse, and description of each product is a mess that needs to be as clean as possible. Whole description is in database as a value with all this mess iports and tags. I am using Jsoup to remake this description, but have no clue how to delete parents without deleting children.


Solution

  • This requires two steps:

    1. To clean unwanted tags and attributes use Whitelist and Jsoup.clean(html, whitelist)
    2. To remove parent you can use element.unwrap(). To remove repeating parents we can move up using a loop and remove them if they are the same.

    That's the code to do this:

    public class JsoupIssue61137870 {
    
        public static void main(final String[] args) throws IOException {
            String html = "  <div class=\"mrd3w m6et0 _2d49e_1O4vF\"> \n"
                    + "        <div class=\"p1td4 pw4go p513t al2kje m10qy mij5n\"> \n"
                    + "         <div class=\"_2d49e_2tor6\" style=\"max-width:871px;max-height:552px\"> \n"
                    + "          <div class=\"ptv8j2\" style=\"padding-top:calc(100% * 552 / 871)\">\n"
                    + "           <img alt=\"alt\" class=\"_2d49e_3B1Cq pt94f9 pt1itw ptux49 w1eai _2d49e_32cUf lazyloaded\" sizes=\"(min-width: 1200px) 560px, (min-width: 992px) 50vw, 100vw\" src=\"https://somelink.com 871w\" width=\"871px\">\n"
                    + "          </div> \n" + "         </div> \n" + "        </div> \n" + "       </div> ";
    
            Whitelist whitelist = Whitelist.none();
            whitelist.addTags("div", "img");
            whitelist.addAttributes("img", "src");
            String cleanHTML = Jsoup.clean(html, whitelist);
            System.out.println(cleanHTML);
    
            String result = removeRepeatingTags(cleanHTML);
            System.out.println(result);
        }
    
        private static String removeRepeatingTags(String html) {
            Document doc = Jsoup.parse(html);
            Element img = doc.selectFirst("img");
            Element parent = img.parent();
            while (parent.tagName().equals(parent.parent().tagName())) {
                parent.unwrap();
                parent = img.parent();
            }
            return doc.toString();
        }
    }
    

    The ouput of the first part is:

    <div> 
     <div> 
      <div> 
       <div> 
        <img alt="alt" src="https://somelink.com 871w"> 
       </div> 
      </div> 
     </div> 
    </div>
    

    and the output after second part is:

    <html>
     <head></head>
     <body>
      <div>    
        <img alt="alt" src="https://somelink.com 871w">  
      </div>
     </body>
    </html>
    

    Jsoup will add <html> <head> and <body> tags. To avoid this instead of

        Document doc = Jsoup.parse(html);
    

    use

        Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    

    and the output will be exactly what you expect:

    <div>    
     <img alt="alt" src="https://somelink.com 871w">    
    </div>