Search code examples
androidhtmljsoup

JSOUP - Accessing elements within a div class / stop when reaching a specific div class


I'm trying to parse data from HTML. I need to get specific content from the html code which the ordering or the html content may be different.

<h1>Latest Deals</h1>\r\n </div>\r\n </div>\r\n</div>\r\n\r\n
<div class=\"breadcrumb-wrapper\">\r\n    
<ul class=\"breadcrumb\">\r\n        
<li><a href=\"/Home\">Home</a></li>\r\n        
<li><a href=\"/Deals\">Deals</a></li>\r\n        
<li class=\"active\">Mau Mudik Hemat? Nikmati Diskon Hingga 20%</li>\r\n 
</ul>\r\n</div>\r\n\r\n
<div class=\"article outer clearfix\">\r\n    
<div class=\"col-sm-12\">\r\n        
<img alt=\"Mau Mudik Hemat? Nikmati Diskon Hingga 20%\" title=\"Mau Mudik Hemat? Nikmati Diskon Hingga 20%\" src=\"/images/slider/id/special-raya-offer-id-v2.jpg\">\r\n        
<h1>Mau Mudik Hemat? Nikmati Diskon Hingga 20%</h1>\r\n        
<p class=\"date\">May 18th, 2018</p>\r\n        
<p><strong class=\"text-red\"></strong></p>\r\n\r\n        
<p>This is the first paragraph</p>\r\n\r\n        
<p>This is the second paragraph.</p>\r\n\r\n        
<p>This is the third paragraph</p>\r\n\r\n        
<p>Below is the point form start:</p>\r\n\r\n        
<ol>\r\n            
<li>Point form A</li>\r\n            
<li>Point form B</li>\r\n            
<li>Point form C</li>\r\n            
<li>Point form D</li>\r\n            
</ol>\r\n\r\n\r\n\r\n        
<div class=\"m-top30 m-bottom20\">\r\n    
<a href=\"/home\" class=\"btn btn-lg btn-orange\">Home</a>\r\n\r\n    \r\n\r\n\r\n</div>\r\n\r\n\r\n

Previously i had successfully get the content i want via:

Document doc = Jsoup.parse(content);
Element eTitle = doc.getElementsByTag("h1").get(1);
Elements eBody = doc.getElementsByTag("p");

for (Element body : eBody) {
   detailContent += "<p>" + body.html() + "</p>";

The code above i getting the first "h1" and all element with "p" from my long html code. However, now in some case i might have element "ol" in between of those "p". For example:

<div class=\"col-sm-12\">\r\n <img alt=\"abc\" title=\"abcd\" src=\"/images/slider/id/abcd.jpg\">\r\n 
<h1>This is the header</h1>\r\n
<p class=\"date\">November 4th, 2015</p>\r\n 
<p><strong class=\"text-red\">Sorry, this promotion has expired.</strong></p>\r\n  
<p> Paragraph 1 </p>\r\n
<p> Paragraph 2 </p>\r\n
<ol>\r\n            
<li> Point 1 </li>\r\n            
<li> Point 2 </li>\r\n            
</ol>\r\n
<p> Paragraph 3 </p>\r\n
<p> Paragraph 4 </p>\r\n
<ol>\r\n            
<li> Point 1 </li>\r\n            
<li> Point 2 </li>\r\n            
</ol>\r\n
<div class=\"m-top30 m-bottom20\">

How should i create my code to get all these item?
*P.s All i want to do is
1) To get the element in "col-sm-12" div / the last element before "m-top30 m-bottom20"
2) Ignore certain element contain in "col-sm-12"


Solution

  • Changing the selectors to CSS and adding the filter such as 'p' under the first div can help you. However from the above html it is not clear whether the first div ends before the starting of the second div. If you share more details about the html, may be we can refine the selectors. I have stated the assumptions/my understanding in the code comment.

        String eTitle = doc.select("div.col-sm-12 > h1").text(); //I'm assuming you are trying to fetch the title text. 
    
        Elements eBody = doc.select("div.col-sm-12 > p , ol"); //This CSS selector will limit the 'p' elements to this div alone. 
    
        for (Element body : eBody) {
          //work with the 'body' element here.