Search code examples
javahtmldomtagsjsoup

How to get text between two Elements in DOM object?


I'm using JSoup to parse this HTML content:

<div class="submitted">
    <strong><a title="View user profile." href="/user/1">user1</a></strong> 
    on 27/09/2011 - 15:17 
    <span class="via"><a href="/goto/002">www.google.com</a></span>
</div> 

Which looks like this in web browser:

user1 on 27/09/2011 - 15:17 www.google.com

The username and the website can be parsed into variables using this:

String user    = content.getElementsByClass("submitted").first().getElementsByTag("strong").first().text(); 
String website = content.getElementsByClass("submitted").first().getElementsByClass("via").first().text();

But I'm unsure of how to get the "on 27/09/2011 -15:17" into a variable, if I use

String date = content.getElementsByClass("submitted").first().text();

It also contains username and the website???


Solution

  • You can always remove the user and the website elements like this (you can clone your submitted element if you do not want the remove actions to "damage" your document):

    public static void main(String[] args) throws Exception {
    
        Document content = Jsoup.parse(
          "<div class=\"submitted\">" +
          "  <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
          "  on 27/09/2011 - 15:17 " + 
          "  <span class=\"via\"><a href=\"/goto/002\">www.google.com</a></span>" +
          "</div> ");
    
        // create a clone of the element so we do not destroy the original
        Element submitted = content.getElementsByClass("submitted").first().clone();
    
        // remove the elements that you do not need 
        submitted.getElementsByTag("strong").remove();
        submitted.getElementsByClass("via").remove();
    
        // print the result (demo)
        System.out.println(submitted.text());
    }
    

    Outputs:

    on 27/09/2011 - 15:17