Search code examples
javajsoup

Jsoup doesn't detect closing </div> tags in the list


I have an HTML string with <div style="position: relative;"> </div> that contains list of ~200 elements, like this:

 <div style="position: relative;">
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">1. Some Text</div>
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">2. Some Text</div>
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;"data-id="0_0_0_0">3. Some Text</div>
    ...
    <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;"data-id="0_0_0_0">200. Some Text</div>
 </div>

I do

Document document = Jsoup.parse(html)

I expect to get a document with a list, lile this:

<html>
 <head></head>
 <body>
   <div style="position: relative;">
      <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">1. Some Text</div>
      <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">2. Some Text</div>
      <div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">3. Some Text</div>
 </body>
</html>

But Jsoup doesn't recognize closing tags of list elements and create document with dozens of nested div`s instead of a list, like this:

<html>
 <head></head>
 <body>
  <div style="\&quot;position:" relative;\>
   <div class="\&quot;episode-name\&quot;" style="\&quot;position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\&quot;0_0_0_0\&quot;">1. Some Unicode&lt;\/div&gt; <-- ORIGINAL </div> tag (its recognized like text?)
    <div class="\&quot;episode-name\&quot;" style="\&quot;position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\&quot;0_0_0_0\&quot;">2. Some Unicode&lt;\/div&gt;
     <div class="\&quot;episode-name\&quot;" style="\&quot;position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\&quot;0_0_0_0\&quot;">3. Some Unicode&lt;\/div&gt;
       </div> <-- the same </div> generated by Jsoup in the end of document instead of the item`s end
     </div>
   </div>
  </div>
 </body>
</html>

This messes up the DOM and makes parsing extremely difficult.
How can I get jsoup parse this fragment correctly?


Solution

  • When I tried to compose the string manually and passing it to jsoup I've found the root of the problem.
    In the html string that the network request returns to me, I have a normal opening <div> tags, but escaped closing </div> tags in the list items. List items contains <\/div> instead of </div>.
    List items looks like this:

    <div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">1. Text<\/div> <--ESCAPED TAG
    

    instead of this:

    <div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">1. Text</div> <-- NORMAL TAG
    

    So Jsoup interprets those escaped closing tags like text and generete incorrect DOM structure.

    SOLUTION:
    Adding a tag escape check before passing html to jsoup and replacing escaped tags with unescaped ones solved the problem. In my example I just use

    html=html.replaceAll("<\\/div>", "</div>")
    document = Jsoup.parse(html)