I have an HTML string with <div style="position: relative;"> </div>
that contains list of ~200 elements, like this:
<div style="position: relative;">
<div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">1. Some Text</div>
<div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">2. Some Text</div>
<div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;"data-id="0_0_0_0">3. Some Text</div>
...
<div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;"data-id="0_0_0_0">200. Some Text</div>
</div>
I do
Document document = Jsoup.parse(html)
I expect to get a document with a list, lile this:
<html>
<head></head>
<body>
<div style="position: relative;">
<div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">1. Some Text</div>
<div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">2. Some Text</div>
<div class="episode-name" style="position: absolute; top: 5px;left: 10px;font-size: 30px;" data-id="0_0_0_0">3. Some Text</div>
</body>
</html>
But Jsoup doesn't recognize closing tags of list elements and create document with dozens of nested div`s instead of a list, like this:
<html>
<head></head>
<body>
<div style="\"position:" relative;\>
<div class="\"episode-name\"" style="\"position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\"0_0_0_0\"">1. Some Unicode<\/div> <-- ORIGINAL </div> tag (its recognized like text?)
<div class="\"episode-name\"" style="\"position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\"0_0_0_0\"">2. Some Unicode<\/div>
<div class="\"episode-name\"" style="\"position:" absolute; top: 5px;left: 10px;font-size: 30px;\ data-id="\"0_0_0_0\"">3. Some Unicode<\/div>
</div> <-- the same </div> generated by Jsoup in the end of document instead of the item`s end
</div>
</div>
</div>
</body>
</html>
This messes up the DOM and makes parsing extremely difficult.
How can I get jsoup parse this fragment correctly?
When I tried to compose the string manually and passing it to jsoup I've found the root of the problem.
In the html string that the network request returns to me, I have a normal opening <div>
tags, but escaped closing </div>
tags in the list items. List items contains <\/div>
instead of </div>
.
List items looks like this:
<div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">1. Text<\/div> <--ESCAPED TAG
instead of this:
<div class=\"episode-name\" style=\"position: absolute; top: 5px;left: 10px;font-size: 30px;\" data-id=\"0_0_0_0\">1. Text</div> <-- NORMAL TAG
So Jsoup interprets those escaped closing tags like text and generete incorrect DOM structure.
SOLUTION:
Adding a tag escape check before passing html to jsoup and replacing escaped tags with unescaped ones solved the problem. In my example I just use
html=html.replaceAll("<\\/div>", "</div>")
document = Jsoup.parse(html)